[00:00:04] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp6009 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2022-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:00:06] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp5016 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2022-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:00:06] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp5013 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2022-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:00:08] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp5012 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2022-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:00:10] (FNMNotReported) firing: FastNetMon metrics not reported - https://wikitech.wikimedia.org/wiki/Fastnetmon - https://w.wiki/8oU - https://alerts.wikimedia.org/?q=alertname%3DFNMNotReported [00:00:14] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp5002 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2022-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:00:18] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp5010 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2022-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:00:30] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp5005 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2022-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:00:30] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp5015 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2022-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:00:34] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp3051 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2022-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:00:34] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp3056 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2022-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:00:36] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp3057 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2022-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:00:38] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp6007 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2022-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:00:38] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp5009 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2022-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:00:40] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:00:42] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp6010 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2022-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:00:42] (03CR) 10Cwhite: opensearch: make upgrade-phatality.sh stricter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/849631 (https://phabricator.wikimedia.org/T304440) (owner: 10Hashar) [00:00:44] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp6016 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2022-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:00:58] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp6015 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2022-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:01:02] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp5003 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2022-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:01:06] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp3064 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2022-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:01:06] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp3054 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2022-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:01:06] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp3055 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2022-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:01:06] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp6001 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2022-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:01:12] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp3050 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2022-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:01:14] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp6013 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2022-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:01:16] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp3058 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2022-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:01:20] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp3060 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2022-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:01:20] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp3063 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2022-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:01:22] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp6005 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2022-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:01:28] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp6012 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2022-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:01:30] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp5004 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2022-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:01:34] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp5007 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2022-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:01:34] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp5006 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2022-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:01:36] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp3061 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2022-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:01:38] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp6008 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2022-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:01:44] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp6002 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2022-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:01:50] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp5011 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2022-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:01:50] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp5014 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2022-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:01:54] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp6004 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2022-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:01:56] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp6003 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2022-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:01:58] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp5008 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2022-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:01:58] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp3062 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2022-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:01:59] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp6006 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2022-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:02:02] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp3065 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2022-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:02:04] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp3052 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2022-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:02:04] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp6014 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2022-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:02:04] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp6011 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2022-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:02:04] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp3059 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2022-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:02:06] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp3053 is CRITICAL: SSL CRITICAL - Certificate *.wikipedia.org (RSA) valid until 2022-11-10 23:59:59 +0000 (expires in 14 days) https://wikitech.wikimedia.org/wiki/HTTPS [00:04:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1016:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [00:04:42] RECOVERY - MariaDB Replica Lag: s3 on db1154 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:05:50] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4044.ulsfo.wmnet with reason: host reimage [00:06:52] RECOVERY - MariaDB Replica Lag: s8 on db1154 is OK: OK slave_sql_lag Replication lag: 0.10 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:07:06] RECOVERY - MariaDB Replica Lag: s8 on clouddb1016 is OK: OK slave_sql_lag Replication lag: 0.19 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:09:23] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4044.ulsfo.wmnet with reason: host reimage [00:14:49] (RdfStreamingUpdaterFlinkJobUnstable) firing: WCQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [00:18:51] RECOVERY - Check systemd state on kubernetes1011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:19:49] (RdfStreamingUpdaterFlinkJobUnstable) resolved: WCQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [00:33:19] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:35:49] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4044.ulsfo.wmnet with OS buster [00:40:49] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [00:54:23] (03PS1) 10Ssingh: cp4043: update site.pp and related configs for cp (text) role [puppet] - 10https://gerrit.wikimedia.org/r/849715 (https://phabricator.wikimedia.org/T317244) [00:57:46] (03CR) 10Ssingh: [C: 03+2] cp4043: update site.pp and related configs for cp (text) role [puppet] - 10https://gerrit.wikimedia.org/r/849715 (https://phabricator.wikimedia.org/T317244) (owner: 10Ssingh) [00:59:16] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4043.ulsfo.wmnet with OS buster [00:59:23] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install cp40[37-51] - https://phabricator.wikimedia.org/T317244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp4043.ulsfo.wmnet with OS buster [01:03:18] PROBLEM - Check systemd state on ms-be2049 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:10:05] (03PS3) 10BCornwall: varnish: Conditionally set WMF-Last-Access cookie [puppet] - 10https://gerrit.wikimedia.org/r/849184 (https://phabricator.wikimedia.org/T262996) [01:10:54] (03CR) 10BCornwall: varnish: Conditionally set WMF-Last-Access cookie (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/849184 (https://phabricator.wikimedia.org/T262996) (owner: 10BCornwall) [01:11:38] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4043.ulsfo.wmnet with OS buster [01:11:45] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install cp40[37-51] - https://phabricator.wikimedia.org/T317244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp4043.ulsfo.wmnet with OS buster executed with errors: - cp4043 (**FA... [01:15:58] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4043.ulsfo.wmnet with OS buster [01:16:06] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install cp40[37-51] - https://phabricator.wikimedia.org/T317244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp4043.ulsfo.wmnet with OS buster [01:23:48] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [01:25:34] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [01:25:48] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:28:34] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:31:36] RECOVERY - Check systemd state on ms-be2049 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:33:32] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad URL) timed out before a response was received: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [01:33:54] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [01:35:05] (03PS1) 10Tim Starling: In Language::ucfirst(), use title case instead of upper case [core] (wmf/1.40.0-wmf.6) - 10https://gerrit.wikimedia.org/r/849670 (https://phabricator.wikimedia.org/T292552) [01:35:57] (03PS1) 10Tim Starling: In Language::ucfirst(), use title case instead of upper case [core] (wmf/1.40.0-wmf.7) - 10https://gerrit.wikimedia.org/r/849671 (https://phabricator.wikimedia.org/T292552) [01:36:19] (03CR) 10Tim Starling: [C: 03+2] In Language::ucfirst(), use title case instead of upper case [core] (wmf/1.40.0-wmf.6) - 10https://gerrit.wikimedia.org/r/849670 (https://phabricator.wikimedia.org/T292552) (owner: 10Tim Starling) [01:36:24] (03CR) 10Tim Starling: [C: 03+2] In Language::ucfirst(), use title case instead of upper case [core] (wmf/1.40.0-wmf.7) - 10https://gerrit.wikimedia.org/r/849671 (https://phabricator.wikimedia.org/T292552) (owner: 10Tim Starling) [01:36:45] (JobUnavailable) firing: Reduced availability for job workhorse in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:36:52] RECOVERY - MariaDB Replica Lag: s1 on db1154 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [01:37:32] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [01:38:59] (KubernetesAPILatency) firing: (15) High Kubernetes API latency (LIST certificates) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [01:41:45] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:41:57] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4043.ulsfo.wmnet with reason: host reimage [01:45:24] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4043.ulsfo.wmnet with reason: host reimage [01:46:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:51:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:52:12] (03Merged) 10jenkins-bot: In Language::ucfirst(), use title case instead of upper case [core] (wmf/1.40.0-wmf.6) - 10https://gerrit.wikimedia.org/r/849670 (https://phabricator.wikimedia.org/T292552) (owner: 10Tim Starling) [01:53:12] (03Merged) 10jenkins-bot: In Language::ucfirst(), use title case instead of upper case [core] (wmf/1.40.0-wmf.7) - 10https://gerrit.wikimedia.org/r/849671 (https://phabricator.wikimedia.org/T292552) (owner: 10Tim Starling) [01:56:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [01:57:23] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [01:57:24] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [01:58:03] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [02:00:05] (03PS1) 10Tim Starling: Temporary identity mappings for title case ligatures [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849724 (https://phabricator.wikimedia.org/T292552) [02:00:49] (03PS25) 10Andrew Bogott: Modify maintain-dbusers.py to call the rest-api service [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [02:02:29] PROBLEM - Check systemd state on mx2001 is CRITICAL: CRITICAL - degraded: The following units failed: generate_otrs_aliases.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:03:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [02:03:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [02:03:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [02:04:20] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [02:06:45] (JobUnavailable) firing: (7) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:06:51] !log tstarling@deploy1002 Synchronized php-1.40.0-wmf.6/includes/language/Language.php: T292552 (duration: 03m 40s) [02:06:56] T292552: Rename articles and users to update our case mapping to PHP 7.4 and Unicode 11 - https://phabricator.wikimedia.org/T292552 [02:10:30] !log tstarling@deploy1002 Synchronized php-1.40.0-wmf.7/includes/language/Language.php: T292552 (duration: 03m 39s) [02:11:45] (JobUnavailable) resolved: (5) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:13:09] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4043.ulsfo.wmnet with OS buster [02:13:17] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install cp40[37-51] - https://phabricator.wikimedia.org/T317244 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp4043.ulsfo.wmnet with OS buster completed: - cp4043 (**PASS**) - R... [02:15:07] PROBLEM - Host parse1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [02:15:15] (03CR) 10Andrew Bogott: [C: 04-1] "One wrinkle that I think your code isn't considering: the PAWS and Tools will have different NFS servers. Each can run a copy of the same " [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [02:25:38] (03CR) 10Tim Starling: [C: 03+2] Temporary identity mappings for title case ligatures [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849724 (https://phabricator.wikimedia.org/T292552) (owner: 10Tim Starling) [02:26:20] (03Merged) 10jenkins-bot: Temporary identity mappings for title case ligatures [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849724 (https://phabricator.wikimedia.org/T292552) (owner: 10Tim Starling) [02:29:46] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [02:30:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [02:30:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [02:30:57] !log tstarling@deploy1002 Synchronized wmf-config/UcfirstOverrides.php: T292552 allow title case ligatures (duration: 03m 36s) [02:31:03] T292552: Rename articles and users to update our case mapping to PHP 7.4 and Unicode 11 - https://phabricator.wikimedia.org/T292552 [02:31:14] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:31:35] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [02:36:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [02:37:58] RECOVERY - Host parse1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 14.30 ms [02:38:10] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [02:40:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [02:40:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [02:41:25] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [02:45:03] (03PS5) 10Tim Starling: Migrate to PHP 7.4 title case mapping, but retain Eszett override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842243 (https://phabricator.wikimedia.org/T292552) [02:46:56] PROBLEM - SSH on mw1327.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:47:16] (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [02:49:44] (03CR) 10Tim Starling: [C: 03+2] Migrate to PHP 7.4 title case mapping, but retain Eszett override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842243 (https://phabricator.wikimedia.org/T292552) (owner: 10Tim Starling) [02:50:27] (03Merged) 10jenkins-bot: Migrate to PHP 7.4 title case mapping, but retain Eszett override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842243 (https://phabricator.wikimedia.org/T292552) (owner: 10Tim Starling) [02:53:14] PROBLEM - SSH on mw1310.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:56:11] !log tstarling@deploy1002 Synchronized wmf-config/UcfirstOverrides.php: T292552 final configuration (duration: 03m 54s) [02:56:17] T292552: Rename articles and users to update our case mapping to PHP 7.4 and Unicode 11 - https://phabricator.wikimedia.org/T292552 [02:56:40] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [02:57:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [02:57:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [02:58:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [02:59:10] RECOVERY - Check systemd state on mx2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:29:32] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:30:02] PROBLEM - Check systemd state on wcqs2003 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:32:38] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [03:47:52] RECOVERY - SSH on mw1327.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:00:10] (FNMNotReported) firing: FastNetMon metrics not reported - https://wikitech.wikimedia.org/wiki/Fastnetmon - https://w.wiki/8oU - https://alerts.wikimedia.org/?q=alertname%3DFNMNotReported [04:04:16] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1016:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [04:14:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1016:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [04:31:01] (03PS9) 10Giuseppe Lavagetto: Add cookbook to restart pybal [cookbooks] - 10https://gerrit.wikimedia.org/r/848949 [04:31:10] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Add cookbook to restart pybal [cookbooks] - 10https://gerrit.wikimedia.org/r/848949 (owner: 10Giuseppe Lavagetto) [04:34:49] (03Merged) 10jenkins-bot: Add cookbook to restart pybal [cookbooks] - 10https://gerrit.wikimedia.org/r/848949 (owner: 10Giuseppe Lavagetto) [04:46:44] PROBLEM - Host parse1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [04:52:44] RECOVERY - Host parse1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.87 ms [04:53:38] RECOVERY - Check systemd state on wcqs2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:59:42] PROBLEM - Check systemd state on wcqs2003 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:01:08] PROBLEM - Host parse1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [05:07:14] RECOVERY - Host parse1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.81 ms [05:09:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1016:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [05:20:50] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 25 hosts with reason: Primary switchover s5 T321178 [05:20:56] T321178: Switchover s5 master (db1100 -> db1130) - https://phabricator.wikimedia.org/T321178 [05:21:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 25 hosts with reason: Primary switchover s5 T321178 [05:21:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Set db1130 with weight 0 T321178', diff saved to https://phabricator.wikimedia.org/P36636 and previous config saved to /var/cache/conftool/dbconfig/20221027-052127-ladsgroup.json [05:22:52] (03CR) 10Sohom Datta: [C: 03+1] "Will schedule this for 6:30-7:30 (IST) today" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849672 (https://phabricator.wikimedia.org/T53980) (owner: 10Bodhisattwa) [05:24:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1016:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [05:24:22] RECOVERY - Check systemd state on wcqs2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:26:34] (03PS4) 10DDesouza: Deploy Research Incentive survey on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/834048 (https://phabricator.wikimedia.org/T318333) [05:28:25] !log dbmaint Switch x1 to SBR T318518 [05:28:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:28:31] T318518: Add `gemm_mentee_is_active` column to growthexperiments_mentor_mentee x1 table - https://phabricator.wikimedia.org/T318518 [05:30:30] PROBLEM - Check systemd state on wcqs2003 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:30:52] 10SRE, 10ops-eqiad, 10DBA: db1154 is not coming back after restart - https://phabricator.wikimedia.org/T321562 (10Marostegui) Thanks a lot John [05:35:08] !log Deploy schema change on x1 T318518 [05:35:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:35:14] T318518: Add `gemm_mentee_is_active` column to growthexperiments_mentor_mentee x1 table - https://phabricator.wikimedia.org/T318518 [05:39:14] (KubernetesAPILatency) firing: (15) High Kubernetes API latency (LIST certificates) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [05:44:06] (03PS1) 10Marostegui: Revert "clouddb1013, clouddb1017: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/849673 [05:44:57] (03CR) 10Marostegui: [C: 03+2] Revert "clouddb1013, clouddb1017: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/849673 (owner: 10Marostegui) [05:45:02] PROBLEM - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is CRITICAL: 132 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:46:10] PROBLEM - Host parse1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [05:46:30] (03PS2) 10Ladsgroup: mariadb: Promote db1130 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/844017 (https://phabricator.wikimedia.org/T321178) (owner: 10Gerrit maintenance bot) [05:46:35] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] mariadb: Promote db1130 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/844017 (https://phabricator.wikimedia.org/T321178) (owner: 10Gerrit maintenance bot) [05:47:02] RECOVERY - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:51:58] PROBLEM - SSH on mw1307.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:52:17] (03PS1) 10Marostegui: Revert "mariadb: Switch x1 to STATEMENT" [puppet] - 10https://gerrit.wikimedia.org/r/849674 [05:52:18] RECOVERY - Host parse1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.62 ms [05:52:58] (03CR) 10Marostegui: [C: 03+2] Revert "mariadb: Switch x1 to STATEMENT" [puppet] - 10https://gerrit.wikimedia.org/r/849674 (owner: 10Marostegui) [05:55:19] (03CR) 10JMeybohm: [C: 03+1] coredns: add standard labels to resources (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/849562 (https://phabricator.wikimedia.org/T321159) (owner: 10Elukey) [05:57:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1016:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [05:57:04] (03CR) 10JMeybohm: [C: 03+2] kubernetes: Actually use the master_fqdn instead of the cert name [puppet] - 10https://gerrit.wikimedia.org/r/849543 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [06:00:05] kormat, marostegui, and Amir1: My dear minions, it's time we take the moon! Just kidding. Time for Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221027T0600). [06:00:22] let's go [06:00:28] yep [06:00:36] PROBLEM - Host parse1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [06:00:55] !log Starting s5 eqiad failover from db1100 to db1130 - T321178 [06:01:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:01] T321178: Switchover s5 master (db1100 -> db1130) - https://phabricator.wikimedia.org/T321178 [06:01:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Set s5 eqiad as read-only for maintenance - T321178', diff saved to https://phabricator.wikimedia.org/P36637 and previous config saved to /var/cache/conftool/dbconfig/20221027-060102-ladsgroup.json [06:01:30] PROBLEM - Host elastic2043 is DOWN: PING CRITICAL - Packet loss = 100% [06:01:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Promote db1130 to s5 primary and set section read-write T321178', diff saved to https://phabricator.wikimedia.org/P36638 and previous config saved to /var/cache/conftool/dbconfig/20221027-060137-ladsgroup.json [06:02:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1016:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [06:05:23] (03PS2) 10Ladsgroup: wmnet: Update s5-master alias [dns] - 10https://gerrit.wikimedia.org/r/844018 (https://phabricator.wikimedia.org/T321178) (owner: 10Gerrit maintenance bot) [06:05:40] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] wmnet: Update s5-master alias [dns] - 10https://gerrit.wikimedia.org/r/844018 (https://phabricator.wikimedia.org/T321178) (owner: 10Gerrit maintenance bot) [06:06:44] RECOVERY - Host parse1001.mgmt is UP: PING WARNING - Packet loss = 75%, RTA = 1.82 ms [06:06:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool db1100 T321178', diff saved to https://phabricator.wikimedia.org/P36639 and previous config saved to /var/cache/conftool/dbconfig/20221027-060654-ladsgroup.json [06:07:00] T321178: Switchover s5 master (db1100 -> db1130) - https://phabricator.wikimedia.org/T321178 [06:07:39] (03PS1) 10Marostegui: add_cuc_user_ip_time_index_T321123.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/849915 (https://phabricator.wikimedia.org/T321123) [06:08:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1100.eqiad.wmnet with reason: Maintenance [06:08:11] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1100.eqiad.wmnet with reason: Maintenance [06:11:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1016:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [06:23:39] RECOVERY - Check systemd state on wcqs2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:27:14] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1100.eqiad.wmnet with reason: Maintenance [06:27:16] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1100.eqiad.wmnet with reason: Maintenance [06:27:58] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [06:28:00] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [06:28:16] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance [06:28:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance [06:28:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2105 (T318950)', diff saved to https://phabricator.wikimedia.org/P36640 and previous config saved to /var/cache/conftool/dbconfig/20221027-062836-ladsgroup.json [06:29:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1100.eqiad.wmnet with reason: Maintenance [06:29:07] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1100.eqiad.wmnet with reason: Maintenance [06:29:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1112.eqiad.wmnet with reason: Maintenance [06:29:55] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1112.eqiad.wmnet with reason: Maintenance [06:29:56] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [06:30:12] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [06:30:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1112 (T318950)', diff saved to https://phabricator.wikimedia.org/P36641 and previous config saved to /var/cache/conftool/dbconfig/20221027-063018-ladsgroup.json [06:34:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T318950)', diff saved to https://phabricator.wikimedia.org/P36642 and previous config saved to /var/cache/conftool/dbconfig/20221027-063414-ladsgroup.json [06:36:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105 (T318950)', diff saved to https://phabricator.wikimedia.org/P36643 and previous config saved to /var/cache/conftool/dbconfig/20221027-063631-ladsgroup.json [06:36:38] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [06:45:02] (03PS26) 10Raymond Ndibe: wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) [06:45:43] (03CR) 10CI reject: [V: 04-1] wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [06:47:16] (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [06:48:46] PROBLEM - Check systemd state on wcqs2003 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:49:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P36644 and previous config saved to /var/cache/conftool/dbconfig/20221027-064921-ladsgroup.json [06:49:25] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on ganeti1009.eqiad.wmnet with reason: Remove from cluster for eventual reimage [06:49:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on ganeti1009.eqiad.wmnet with reason: Remove from cluster for eventual reimage [06:51:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105', diff saved to https://phabricator.wikimedia.org/P36645 and previous config saved to /var/cache/conftool/dbconfig/20221027-065138-ladsgroup.json [06:52:42] RECOVERY - SSH on mw1307.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:52:55] (03CR) 10David Caro: Modify maintain-dbusers.py to call the rest-api service (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [06:53:52] RECOVERY - Check systemd state on wcqs2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:54:57] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1009.eqiad.wmnet with OS bullseye [06:55:03] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1009.eqiad.wmnet with OS bullseye [06:55:27] (03PS1) 10David Caro: openstack.add_flavor: create cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/849957 (https://phabricator.wikimedia.org/T321657) [06:57:06] RECOVERY - SSH on mw1310.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:59:08] (03CR) 10CI reject: [V: 04-1] openstack.add_flavor: create cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/849957 (https://phabricator.wikimedia.org/T321657) (owner: 10David Caro) [07:00:04] Amir1, apergos, and jnuche: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC morning backport and config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221027T0700). [07:00:13] morning! there are two trainees signed up this morning but no patches in the window. I guess I'll give them the links for the docs and say a few words about how much simpler the deployment process is now with the new scap backport command, and then see if they want to reschedule, heh. [07:03:16] I'm joining too [07:04:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P36646 and previous config saved to /var/cache/conftool/dbconfig/20221027-070427-ladsgroup.json [07:05:16] (03PS2) 10David Caro: openstack.add_flavor: create cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/849957 (https://phabricator.wikimedia.org/T321657) [07:06:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105', diff saved to https://phabricator.wikimedia.org/P36647 and previous config saved to /var/cache/conftool/dbconfig/20221027-070644-ladsgroup.json [07:08:41] (03CR) 10CI reject: [V: 04-1] openstack.add_flavor: create cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/849957 (https://phabricator.wikimedia.org/T321657) (owner: 10David Caro) [07:09:18] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1009.eqiad.wmnet with reason: host reimage [07:12:29] (03PS27) 10Raymond Ndibe: wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) [07:12:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1009.eqiad.wmnet with reason: host reimage [07:12:54] (03CR) 10CI reject: [V: 04-1] wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [07:14:47] (03PS3) 10David Caro: openstack.add_flavor: create cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/849957 (https://phabricator.wikimedia.org/T321657) [07:15:12] (03CR) 10Raymond Ndibe: wmcs: changes to api service to manage toolforge replica.my.cnf (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [07:17:35] thanks everybody, see you all next time! [07:19:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T318950)', diff saved to https://phabricator.wikimedia.org/P36648 and previous config saved to /var/cache/conftool/dbconfig/20221027-071934-ladsgroup.json [07:19:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [07:19:40] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [07:19:47] apergos: ty for your clarifications and directions! [07:19:49] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [07:20:05] sure thing sergi0_ thanks for showing up! [07:21:11] (03PS28) 10Raymond Ndibe: wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) [07:21:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1157.eqiad.wmnet with reason: Maintenance [07:21:42] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1157.eqiad.wmnet with reason: Maintenance [07:21:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1157 (T318950)', diff saved to https://phabricator.wikimedia.org/P36649 and previous config saved to /var/cache/conftool/dbconfig/20221027-072148-ladsgroup.json [07:21:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105 (T318950)', diff saved to https://phabricator.wikimedia.org/P36650 and previous config saved to /var/cache/conftool/dbconfig/20221027-072157-ladsgroup.json [07:21:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2109.codfw.wmnet with reason: Maintenance [07:22:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2109.codfw.wmnet with reason: Maintenance [07:22:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2109 (T318950)', diff saved to https://phabricator.wikimedia.org/P36651 and previous config saved to /var/cache/conftool/dbconfig/20221027-072219-ladsgroup.json [07:24:23] (03PS29) 10Raymond Ndibe: wmcs: changes to api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) [07:24:41] (03CR) 10Raymond Ndibe: wmcs: changes to api service to manage toolforge replica.my.cnf (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [07:25:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es2023.codfw.wmnet with reason: Maintenance [07:25:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2023.codfw.wmnet with reason: Maintenance [07:25:33] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Duplicate monitoring for systemd::timer::job - https://phabricator.wikimedia.org/T303253 (10Joe) Hi, as stated in the email thread, I don't think this is a good course of action. `systemd::monitor` offers more functionality than the generic... [07:25:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling es2023 (T321312)', diff saved to https://phabricator.wikimedia.org/P36652 and previous config saved to /var/cache/conftool/dbconfig/20221027-072536-ladsgroup.json [07:25:41] (03PS2) 10Giuseppe Lavagetto: monitoring: introduce exclude list for checking systemd units [puppet] - 10https://gerrit.wikimedia.org/r/849928 (https://phabricator.wikimedia.org/T303253) [07:25:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T318950)', diff saved to https://phabricator.wikimedia.org/P36653 and previous config saved to /var/cache/conftool/dbconfig/20221027-072543-ladsgroup.json [07:25:45] (03PS2) 10Giuseppe Lavagetto: check_systemd_state: consume exclusion list [puppet] - 10https://gerrit.wikimedia.org/r/849929 (https://phabricator.wikimedia.org/T303253) [07:25:49] PROBLEM - Check systemd state on wcqs2003 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:25:49] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [07:26:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1016:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [07:27:32] (03CR) 10CI reject: [V: 04-1] monitoring: introduce exclude list for checking systemd units [puppet] - 10https://gerrit.wikimedia.org/r/849928 (https://phabricator.wikimedia.org/T303253) (owner: 10Giuseppe Lavagetto) [07:27:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1009.eqiad.wmnet with OS bullseye [07:27:50] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1009.eqiad.wmnet with OS bullseye completed: - ganeti1009 (**PASS**) - Downtimed on... [07:27:53] (03CR) 10CI reject: [V: 04-1] check_systemd_state: consume exclusion list [puppet] - 10https://gerrit.wikimedia.org/r/849929 (https://phabricator.wikimedia.org/T303253) (owner: 10Giuseppe Lavagetto) [07:28:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1016:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [07:30:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T318950)', diff saved to https://phabricator.wikimedia.org/P36654 and previous config saved to /var/cache/conftool/dbconfig/20221027-073014-ladsgroup.json [07:32:27] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1009.eqiad.wmnet [07:36:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2023 (T321312)', diff saved to https://phabricator.wikimedia.org/P36655 and previous config saved to /var/cache/conftool/dbconfig/20221027-073612-ladsgroup.json [07:38:06] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of kubestagetcd1005.eqiad.wmnet to drbd [07:39:35] !log restarting blazegraph on wdqs1016 (BlazegraphFreeAllocatorsDecreasingRapidly) [07:39:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1009.eqiad.wmnet [07:40:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P36656 and previous config saved to /var/cache/conftool/dbconfig/20221027-074050-ladsgroup.json [07:43:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1016:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [07:44:45] (03PS3) 10Giuseppe Lavagetto: monitoring: introduce exclude list for checking systemd units [puppet] - 10https://gerrit.wikimedia.org/r/849928 (https://phabricator.wikimedia.org/T303253) [07:44:47] (03PS3) 10Giuseppe Lavagetto: check_systemd_state: consume exclusion list [puppet] - 10https://gerrit.wikimedia.org/r/849929 (https://phabricator.wikimedia.org/T303253) [07:45:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P36657 and previous config saved to /var/cache/conftool/dbconfig/20221027-074521-ladsgroup.json [07:46:25] (03CR) 10CI reject: [V: 04-1] monitoring: introduce exclude list for checking systemd units [puppet] - 10https://gerrit.wikimedia.org/r/849928 (https://phabricator.wikimedia.org/T303253) (owner: 10Giuseppe Lavagetto) [07:46:48] (03CR) 10CI reject: [V: 04-1] check_systemd_state: consume exclusion list [puppet] - 10https://gerrit.wikimedia.org/r/849929 (https://phabricator.wikimedia.org/T303253) (owner: 10Giuseppe Lavagetto) [07:47:43] (03PS1) 10David Caro: puppet_enc: use the repo-wide line length and fix profile [puppet] - 10https://gerrit.wikimedia.org/r/850005 [07:48:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of kubestagetcd1005.eqiad.wmnet to drbd [07:48:26] PROBLEM - Host kubestagetcd1005 is DOWN: PING CRITICAL - Packet loss = 100% [07:48:55] ^ can be ignored, monitoring glitch [07:49:06] RECOVERY - Host kubestagetcd1005 is UP: PING OK - Packet loss = 0%, RTA = 0.48 ms [07:49:58] (03CR) 10CI reject: [V: 04-1] puppet_enc: use the repo-wide line length and fix profile [puppet] - 10https://gerrit.wikimedia.org/r/850005 (owner: 10David Caro) [07:50:59] (03PS23) 10Slyngshede: role::idm Basic deployment of IDM [puppet] - 10https://gerrit.wikimedia.org/r/842753 (https://phabricator.wikimedia.org/T320428) [07:51:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2023', diff saved to https://phabricator.wikimedia.org/P36658 and previous config saved to /var/cache/conftool/dbconfig/20221027-075118-ladsgroup.json [07:51:34] (03CR) 10CI reject: [V: 04-1] role::idm Basic deployment of IDM [puppet] - 10https://gerrit.wikimedia.org/r/842753 (https://phabricator.wikimedia.org/T320428) (owner: 10Slyngshede) [07:53:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2117.codfw.wmnet with reason: Maintenance [07:53:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2117.codfw.wmnet with reason: Maintenance [07:53:26] (03PS2) 10David Caro: puppet_enc: use the repo-wide line length and fix profile [puppet] - 10https://gerrit.wikimedia.org/r/850005 [07:53:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2117 (T318605)', diff saved to https://phabricator.wikimedia.org/P36659 and previous config saved to /var/cache/conftool/dbconfig/20221027-075327-ladsgroup.json [07:53:33] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [07:53:39] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of kubestagetcd1005.eqiad.wmnet to plain [07:53:45] (03PS24) 10Slyngshede: role::idm Basic deployment of IDM [puppet] - 10https://gerrit.wikimedia.org/r/842753 (https://phabricator.wikimedia.org/T320428) [07:54:14] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es1024.eqiad.wmnet with reason: Maintenance [07:54:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es1024.eqiad.wmnet with reason: Maintenance [07:54:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling es1024 (T321312)', diff saved to https://phabricator.wikimedia.org/P36660 and previous config saved to /var/cache/conftool/dbconfig/20221027-075433-ladsgroup.json [07:54:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [07:55:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of kubestagetcd1005.eqiad.wmnet to plain [07:55:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P36661 and previous config saved to /var/cache/conftool/dbconfig/20221027-075556-ladsgroup.json [07:59:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [08:00:04] jnuche and hashar: #bothumor My software never has bugs. It just develops random features. Rise for MediaWiki train - Utc-0 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221027T0800). [08:00:10] (FNMNotReported) firing: FastNetMon metrics not reported - https://wikitech.wikimedia.org/wiki/Fastnetmon - https://w.wiki/8oU - https://alerts.wikimedia.org/?q=alertname%3DFNMNotReported [08:00:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P36662 and previous config saved to /var/cache/conftool/dbconfig/20221027-080027-ladsgroup.json [08:02:46] (03CR) 10Slyngshede: role::idm Basic deployment of IDM (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/842753 (https://phabricator.wikimedia.org/T320428) (owner: 10Slyngshede) [08:02:57] (03CR) 10Slyngshede: role::idm Basic deployment of IDM (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/842753 (https://phabricator.wikimedia.org/T320428) (owner: 10Slyngshede) [08:03:45] (03CR) 10Elukey: "Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/849562 (https://phabricator.wikimedia.org/T321159) (owner: 10Elukey) [08:05:49] (03PS1) 10Marostegui: mariadb: Promote pc1014 to pc3 master [puppet] - 10https://gerrit.wikimedia.org/r/850027 [08:06:22] (03PS1) 10TrainBranchBot: all wikis to 1.40.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850028 (https://phabricator.wikimedia.org/T320512) [08:06:24] (03CR) 10TrainBranchBot: [C: 03+2] all wikis to 1.40.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850028 (https://phabricator.wikimedia.org/T320512) (owner: 10TrainBranchBot) [08:06:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2023', diff saved to https://phabricator.wikimedia.org/P36663 and previous config saved to /var/cache/conftool/dbconfig/20221027-080625-ladsgroup.json [08:06:51] (03PS1) 10Marostegui: ProductionServices.php: Promote pc1014 to pc3 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850029 [08:07:13] (03Merged) 10jenkins-bot: all wikis to 1.40.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850028 (https://phabricator.wikimedia.org/T320512) (owner: 10TrainBranchBot) [08:08:25] (03CR) 10Ladsgroup: [C: 03+1] ProductionServices.php: Promote pc1014 to pc3 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850029 (owner: 10Marostegui) [08:08:48] jnuche: Could you let me know once you've finished with the train? [08:09:16] marostegui: sure, will do [08:09:22] thank you [08:09:51] (03CR) 10Muehlenhoff: "Looks good, one final nit." [puppet] - 10https://gerrit.wikimedia.org/r/842753 (https://phabricator.wikimedia.org/T320428) (owner: 10Slyngshede) [08:11:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T318950)', diff saved to https://phabricator.wikimedia.org/P36664 and previous config saved to /var/cache/conftool/dbconfig/20221027-081103-ladsgroup.json [08:11:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance [08:11:07] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance [08:11:10] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [08:11:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T318950)', diff saved to https://phabricator.wikimedia.org/P36665 and previous config saved to /var/cache/conftool/dbconfig/20221027-081113-ladsgroup.json [08:11:24] !log jnuche@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.40.0-wmf.7 refs T320512 [08:11:29] T320512: 1.40.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T320512 [08:11:45] (03CR) 10Ladsgroup: [C: 03+1] "This really should be automated :sob:" [puppet] - 10https://gerrit.wikimedia.org/r/850027 (owner: 10Marostegui) [08:13:17] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [08:13:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1024 (T321312)', diff saved to https://phabricator.wikimedia.org/P36666 and previous config saved to /var/cache/conftool/dbconfig/20221027-081339-ladsgroup.json [08:14:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [08:14:18] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [08:15:13] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [08:15:16] marostegui: deployment is complete [08:15:23] thank you! [08:15:27] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote pc1014 to pc3 master [puppet] - 10https://gerrit.wikimedia.org/r/850027 (owner: 10Marostegui) [08:15:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T318950)', diff saved to https://phabricator.wikimedia.org/P36667 and previous config saved to /var/cache/conftool/dbconfig/20221027-081534-ladsgroup.json [08:15:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2139.codfw.wmnet with reason: Maintenance [08:15:50] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2139.codfw.wmnet with reason: Maintenance [08:16:17] (03CR) 10Marostegui: [C: 03+2] ProductionServices.php: Promote pc1014 to pc3 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850029 (owner: 10Marostegui) [08:16:40] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1009.eqiad.wmnet to cluster eqiad and group C [08:16:58] (03Merged) 10jenkins-bot: ProductionServices.php: Promote pc1014 to pc3 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850029 (owner: 10Marostegui) [08:17:15] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by marostegui@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850029 (owner: 10Marostegui) [08:17:30] !log marostegui@deploy1002 Started scap: Backport for [[gerrit:850029|ProductionServices.php: Promote pc1014 to pc3 master]] [08:17:31] Amir1: ^ \o/ [08:17:36] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti1009.eqiad.wmnet to cluster eqiad and group C [08:17:45] wohooo [08:17:49] !log marostegui@deploy1002 marostegui and marostegui: Backport for [[gerrit:850029|ProductionServices.php: Promote pc1014 to pc3 master]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet [08:17:51] this thing is awesome [08:18:10] (03CR) 10Ladsgroup: [C: 04-1] add_cuc_user_ip_time_index_T321123.py: New schema change (035 comments) [software/schema-changes] - 10https://gerrit.wikimedia.org/r/849915 (https://phabricator.wikimedia.org/T321123) (owner: 10Marostegui) [08:18:40] oh schema changes show up here too. nice [08:19:15] !log upload vim python3-stdlib-extensions to buster componet/python39 [08:19:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:14] !log powercycle elastic2043 - no mgmt console tty available, not responsive to ssh, memory/dimm errors in `racadm getsel` [08:20:15] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [08:20:17] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] puppet_enc: use the repo-wide line length and fix profile [puppet] - 10https://gerrit.wikimedia.org/r/850005 (owner: 10David Caro) [08:20:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:21] (03CR) 10Jaime Nuche: [C: 03+1] opensearch: make upgrade-phatality.sh stricter [puppet] - 10https://gerrit.wikimedia.org/r/849631 (https://phabricator.wikimedia.org/T304440) (owner: 10Hashar) [08:20:59] (03PS2) 10Marostegui: add_cuc_user_ip_time_index_T321123.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/849915 (https://phabricator.wikimedia.org/T321123) [08:21:16] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [08:21:17] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [08:21:22] (03CR) 10Marostegui: add_cuc_user_ip_time_index_T321123.py: New schema change (035 comments) [software/schema-changes] - 10https://gerrit.wikimedia.org/r/849915 (https://phabricator.wikimedia.org/T321123) (owner: 10Marostegui) [08:21:30] 10ops-codfw, 10Discovery-Search: elastic2043 reported memory errors - https://phabricator.wikimedia.org/T321771 (10elukey) [08:21:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2023 (T321312)', diff saved to https://phabricator.wikimedia.org/P36668 and previous config saved to /var/cache/conftool/dbconfig/20221027-082131-ladsgroup.json [08:21:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es2025.codfw.wmnet with reason: Maintenance [08:21:47] (03CR) 10Clément Goubert: monitoring: introduce exclude list for checking systemd units (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/849928 (https://phabricator.wikimedia.org/T303253) (owner: 10Giuseppe Lavagetto) [08:21:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2025.codfw.wmnet with reason: Maintenance [08:21:51] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2149.codfw.wmnet with reason: Maintenance [08:21:52] !log marostegui@deploy1002 Finished scap: Backport for [[gerrit:850029|ProductionServices.php: Promote pc1014 to pc3 master]] (duration: 04m 22s) [08:21:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling es2025 (T321312)', diff saved to https://phabricator.wikimedia.org/P36669 and previous config saved to /var/cache/conftool/dbconfig/20221027-082157-ladsgroup.json [08:21:58] (03CR) 10CI reject: [V: 04-1] add_cuc_user_ip_time_index_T321123.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/849915 (https://phabricator.wikimedia.org/T321123) (owner: 10Marostegui) [08:22:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2149.codfw.wmnet with reason: Maintenance [08:22:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2149 (T318950)', diff saved to https://phabricator.wikimedia.org/P36670 and previous config saved to /var/cache/conftool/dbconfig/20221027-082211-ladsgroup.json [08:22:16] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [08:22:17] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [08:22:20] RECOVERY - Host elastic2043 is UP: PING OK - Packet loss = 0%, RTA = 31.73 ms [08:23:35] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] openstack.add_flavor: create cookbook (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/849957 (https://phabricator.wikimedia.org/T321657) (owner: 10David Caro) [08:23:43] (03PS3) 10Marostegui: add_cuc_user_ip_time_index_T321123.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/849915 (https://phabricator.wikimedia.org/T321123) [08:24:12] PROBLEM - Check systemd state on elastic2043 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:25:09] (03CR) 10Elukey: [C: 03+2] coredns: add standard labels to resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/849562 (https://phabricator.wikimedia.org/T321159) (owner: 10Elukey) [08:26:16] RECOVERY - Check systemd state on elastic2043 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:27:00] (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/849494 (https://phabricator.wikimedia.org/T274666) (owner: 10Majavah) [08:27:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117 (T318605)', diff saved to https://phabricator.wikimedia.org/P36671 and previous config saved to /var/cache/conftool/dbconfig/20221027-082707-ladsgroup.json [08:27:14] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [08:28:38] (03PS25) 10Slyngshede: role::idm Basic deployment of IDM [puppet] - 10https://gerrit.wikimedia.org/r/842753 (https://phabricator.wikimedia.org/T320428) [08:28:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1024', diff saved to https://phabricator.wikimedia.org/P36672 and previous config saved to /var/cache/conftool/dbconfig/20221027-082846-ladsgroup.json [08:29:14] (03CR) 10Ladsgroup: [C: 03+1] add_cuc_user_ip_time_index_T321123.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/849915 (https://phabricator.wikimedia.org/T321123) (owner: 10Marostegui) [08:30:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T318950)', diff saved to https://phabricator.wikimedia.org/P36673 and previous config saved to /var/cache/conftool/dbconfig/20221027-083017-ladsgroup.json [08:30:24] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [08:30:27] (03CR) 10David Caro: reprepro: add kubeadm-k8s-1-21/22 bullseye suite (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/848354 (https://phabricator.wikimedia.org/T316541) (owner: 10David Caro) [08:31:07] (03CR) 10Slyngshede: role::idm Basic deployment of IDM (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/842753 (https://phabricator.wikimedia.org/T320428) (owner: 10Slyngshede) [08:32:40] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [08:32:43] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [08:34:28] 10SRE, 10MW-on-K8s, 10serviceops: Re-think how we separate traffic to mediawiki in clusters. - https://phabricator.wikimedia.org/T291918 (10Clement_Goubert) Settling on `mw-web` as there's been no contrary opinion in a week. [08:35:52] (03CR) 10David Caro: [C: 03+2] openstack.add_flavor: create cookbook (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/849957 (https://phabricator.wikimedia.org/T321657) (owner: 10David Caro) [08:36:16] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [08:36:19] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [08:36:24] (03CR) 10Marostegui: [C: 03+2] add_cuc_user_ip_time_index_T321123.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/849915 (https://phabricator.wikimedia.org/T321123) (owner: 10Marostegui) [08:37:04] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [08:37:06] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [08:37:49] (03CR) 10Kosta Harlan: [C: 03+2] [labs] GrowthExperiments: Beta wikis to use NewImpact module by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/844046 (https://phabricator.wikimedia.org/T311299) (owner: 10Kosta Harlan) [08:38:33] (03Merged) 10jenkins-bot: [labs] GrowthExperiments: Beta wikis to use NewImpact module by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/844046 (https://phabricator.wikimedia.org/T311299) (owner: 10Kosta Harlan) [08:38:39] (03CR) 10David Caro: [C: 03+2] p::metricsinfra:haproxy: Allow exposing federation endpoints [puppet] - 10https://gerrit.wikimedia.org/r/829746 (https://phabricator.wikimedia.org/T313031) (owner: 10David Caro) [08:38:56] (03PS8) 10David Caro: p::metricsinfra:haproxy: Allow exposing federation endpoints [puppet] - 10https://gerrit.wikimedia.org/r/829746 (https://phabricator.wikimedia.org/T313031) [08:38:59] (KubernetesAPILatency) firing: (15) High Kubernetes API latency (LIST certificates) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:39:04] (03Merged) 10jenkins-bot: openstack.add_flavor: create cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/849957 (https://phabricator.wikimedia.org/T321657) (owner: 10David Caro) [08:39:06] (03Merged) 10jenkins-bot: add_cuc_user_ip_time_index_T321123.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/849915 (https://phabricator.wikimedia.org/T321123) (owner: 10Marostegui) [08:40:48] PROBLEM - SSH on mw1326.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:41:26] RECOVERY - Check systemd state on ml-serve-ctrl1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:42:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117', diff saved to https://phabricator.wikimedia.org/P36674 and previous config saved to /var/cache/conftool/dbconfig/20221027-084214-ladsgroup.json [08:42:28] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:42:29] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [08:43:23] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [08:43:25] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [08:43:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1024', diff saved to https://phabricator.wikimedia.org/P36675 and previous config saved to /var/cache/conftool/dbconfig/20221027-084352-ladsgroup.json [08:43:59] (KubernetesAPILatency) firing: (15) High Kubernetes API latency (LIST certificates) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:44:21] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [08:45:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P36676 and previous config saved to /var/cache/conftool/dbconfig/20221027-084523-ladsgroup.json [08:46:53] (03PS1) 10Marostegui: Revert "ProductionServices.php: Promote pc1014 to pc3 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850064 [08:47:09] (03PS1) 10Marostegui: Revert "mariadb: Promote pc1014 to pc3 master" [puppet] - 10https://gerrit.wikimedia.org/r/850065 [08:47:45] (03CR) 10Marostegui: [C: 03+2] Revert "ProductionServices.php: Promote pc1014 to pc3 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850064 (owner: 10Marostegui) [08:47:48] (03CR) 10Muehlenhoff: [C: 03+2] Set profile::contacts::role_contacts for gitlab runners [puppet] - 10https://gerrit.wikimedia.org/r/849063 (owner: 10Muehlenhoff) [08:48:14] (03CR) 10Marostegui: [C: 03+2] Revert "mariadb: Promote pc1014 to pc3 master" [puppet] - 10https://gerrit.wikimedia.org/r/850065 (owner: 10Marostegui) [08:48:40] (03Merged) 10jenkins-bot: Revert "ProductionServices.php: Promote pc1014 to pc3 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850064 (owner: 10Marostegui) [08:48:52] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by marostegui@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850064 (owner: 10Marostegui) [08:49:48] 10SRE, 10SRE-Access-Requests: Requesting deployment group membership for mfossati - https://phabricator.wikimedia.org/T321772 (10mfossati) [08:50:37] !log marostegui@deploy1002 Started scap: Backport for [[gerrit:850064|Revert "ProductionServices.php: Promote pc1014 to pc3 master"]] [08:50:56] !log marostegui@deploy1002 marostegui and marostegui: Backport for [[gerrit:850064|Revert "ProductionServices.php: Promote pc1014 to pc3 master"]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [08:51:51] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/842753 (https://phabricator.wikimedia.org/T320428) (owner: 10Slyngshede) [08:52:41] (03PS1) 10Marostegui: pc1014: Move it to pc2 [puppet] - 10https://gerrit.wikimedia.org/r/850037 [08:53:38] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Duplicate monitoring for systemd::timer::job - https://phabricator.wikimedia.org/T303253 (10fgiunchedi) An update based on the feedback received by SREs: individual alerts for each `::job` are considered useful because the alerts can be dow... [08:54:06] (03CR) 10WMDE-Fisch: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842353 (https://phabricator.wikimedia.org/T320692) (owner: 10WMDE-Fisch) [08:54:23] 10SRE, 10SRE-Access-Requests: Requesting deployment group membership for mfossati - https://phabricator.wikimedia.org/T321772 (10mfossati) Noting that I completed a deployment training session, see {T302204}. It will be useful for the next one to have deployment access, see {T313812}. @thcipriani : not sure a... [08:54:26] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [08:54:38] PROBLEM - SSH on mw1338.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:55:24] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2114.codfw.wmnet with reason: Maintenance [08:55:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [08:55:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [08:55:30] !log marostegui@deploy1002 Finished scap: Backport for [[gerrit:850064|Revert "ProductionServices.php: Promote pc1014 to pc3 master"]] (duration: 04m 52s) [08:55:38] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2114.codfw.wmnet with reason: Maintenance [08:56:09] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2117.codfw.wmnet with reason: Maintenance [08:56:11] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2117.codfw.wmnet with reason: Maintenance [08:56:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2117 (T321123)', diff saved to https://phabricator.wikimedia.org/P36678 and previous config saved to /var/cache/conftool/dbconfig/20221027-085617-marostegui.json [08:56:18] PROBLEM - Host parse1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [08:56:22] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [08:56:23] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [08:56:56] (03PS3) 10WMDE-Fisch: Enable show nearby feature on de.wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842353 (https://phabricator.wikimedia.org/T320692) [08:57:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117', diff saved to https://phabricator.wikimedia.org/P36679 and previous config saved to /var/cache/conftool/dbconfig/20221027-085720-ladsgroup.json [08:57:31] !log elukey@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'sync'. [08:57:36] !log elukey@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'. [08:57:55] (03PS4) 10Giuseppe Lavagetto: monitoring: introduce exclude list for checking systemd units [puppet] - 10https://gerrit.wikimedia.org/r/849928 (https://phabricator.wikimedia.org/T303253) [08:57:57] (03PS4) 10Giuseppe Lavagetto: check_systemd_state: consume exclusion list [puppet] - 10https://gerrit.wikimedia.org/r/849929 (https://phabricator.wikimedia.org/T303253) [08:58:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117 (T321123)', diff saved to https://phabricator.wikimedia.org/P36680 and previous config saved to /var/cache/conftool/dbconfig/20221027-085829-marostegui.json [08:58:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1024 (T321312)', diff saved to https://phabricator.wikimedia.org/P36681 and previous config saved to /var/cache/conftool/dbconfig/20221027-085859-ladsgroup.json [08:59:04] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es1025.eqiad.wmnet with reason: Maintenance [08:59:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es1025.eqiad.wmnet with reason: Maintenance [08:59:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling es1025 (T321312)', diff saved to https://phabricator.wikimedia.org/P36682 and previous config saved to /var/cache/conftool/dbconfig/20221027-085934-ladsgroup.json [08:59:38] (03PS1) 10Jbond: tox.ini: drop support for python3.7/3.8 [cookbooks] - 10https://gerrit.wikimedia.org/r/850038 [09:00:00] (03CR) 10Marostegui: [C: 03+2] pc1014: Move it to pc2 [puppet] - 10https://gerrit.wikimedia.org/r/850037 (owner: 10Marostegui) [09:00:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P36683 and previous config saved to /var/cache/conftool/dbconfig/20221027-090030-ladsgroup.json [09:00:53] (03CR) 10CI reject: [V: 04-1] tox.ini: drop support for python3.7/3.8 [cookbooks] - 10https://gerrit.wikimedia.org/r/850038 (owner: 10Jbond) [09:01:21] RECOVERY - Host parse1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.41 ms [09:08:53] (03Abandoned) 10Filippo Giunchedi: dns: generate HOST.mgmt records in all statuses [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/849495 (https://phabricator.wikimedia.org/T320721) (owner: 10Filippo Giunchedi) [09:09:34] !log elukey@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'sync'. [09:09:36] !log elukey@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'sync'. [09:10:03] !log elukey@deploy1002 helmfile [eqiad] START helmfile.d/admin 'sync'. [09:10:05] !log elukey@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'sync'. [09:10:32] !log elukey@deploy1002 helmfile [codfw] START helmfile.d/admin 'sync'. [09:10:37] !log elukey@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'sync'. [09:11:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T318950)', diff saved to https://phabricator.wikimedia.org/P36684 and previous config saved to /var/cache/conftool/dbconfig/20221027-091130-ladsgroup.json [09:11:36] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [09:12:09] (03PS1) 10Marostegui: ProductionServices.php: Promote pc2014 to pc1 codfw master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850040 [09:12:15] (03CR) 10Vgutierrez: [C: 04-2] "Let's get rid of the trafficserver9 component (after copying the packages to main of course)" [puppet] - 10https://gerrit.wikimedia.org/r/849640 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [09:12:25] PROBLEM - MegaRAID on an-worker1083 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [09:12:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117 (T318605)', diff saved to https://phabricator.wikimedia.org/P36685 and previous config saved to /var/cache/conftool/dbconfig/20221027-091227-ladsgroup.json [09:12:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2124.codfw.wmnet with reason: Maintenance [09:12:33] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [09:12:42] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2124.codfw.wmnet with reason: Maintenance [09:12:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2124 (T318605)', diff saved to https://phabricator.wikimedia.org/P36686 and previous config saved to /var/cache/conftool/dbconfig/20221027-091249-ladsgroup.json [09:13:00] !log elukey@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'. [09:13:02] !log elukey@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'. [09:13:22] (03PS1) 10Marostegui: mariadb: Promote pc2014 to pc1 codfw master [puppet] - 10https://gerrit.wikimedia.org/r/850041 [09:13:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117', diff saved to https://phabricator.wikimedia.org/P36687 and previous config saved to /var/cache/conftool/dbconfig/20221027-091336-marostegui.json [09:14:28] 10SRE, 10Traffic: Clean up after ATS 9.x upgrade - https://phabricator.wikimedia.org/T321776 (10Vgutierrez) [09:14:42] (03CR) 10Filippo Giunchedi: [C: 03+2] confd: don't backup tidied files [puppet] - 10https://gerrit.wikimedia.org/r/849600 (https://phabricator.wikimedia.org/T321678) (owner: 10Filippo Giunchedi) [09:14:52] 10SRE, 10Traffic: Clean up after ATS 9.x upgrade - https://phabricator.wikimedia.org/T321776 (10Vgutierrez) p:05Triage→03Medium [09:15:12] (03PS2) 10Marostegui: ProductionServices.php: Promote pc2014 to pc1 codfw master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850040 [09:15:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T318950)', diff saved to https://phabricator.wikimedia.org/P36688 and previous config saved to /var/cache/conftool/dbconfig/20221027-091536-ladsgroup.json [09:15:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2156.codfw.wmnet with reason: Maintenance [09:15:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2156.codfw.wmnet with reason: Maintenance [09:15:54] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2094.codfw.wmnet with reason: Maintenance [09:15:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2094.codfw.wmnet with reason: Maintenance [09:16:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2156 (T318950)', diff saved to https://phabricator.wikimedia.org/P36689 and previous config saved to /var/cache/conftool/dbconfig/20221027-091603-ladsgroup.json [09:17:39] !log failover ganeti master in ulsfo to ganeti4008, unblocking future decom of ganeti4003 T317247 [09:17:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:45] T317247: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 [09:17:52] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] global: replace labsproject by wmcs_project (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/849473 (owner: 10David Caro) [09:19:18] 10SRE, 10serviceops-radar, 10Patch-For-Review, 10SRE Observability (FY2022/2023-Q1), 10User-fgiunchedi: Clean up stale/old confd errors automatically - https://phabricator.wikimedia.org/T321678 (10fgiunchedi) 05Open→03Resolved This is done! [09:19:25] 10SRE, 10serviceops-radar, 10Patch-For-Review, 10SRE Observability (FY2022/2023-Q1), 10User-fgiunchedi: Reduce IRC flood/spam during incidents - https://phabricator.wikimedia.org/T314118 (10fgiunchedi) [09:20:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2025 (T321312)', diff saved to https://phabricator.wikimedia.org/P36691 and previous config saved to /var/cache/conftool/dbconfig/20221027-092028-ladsgroup.json [09:21:13] PROBLEM - ganeti-wconfd running on ganeti4003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [09:23:13] RECOVERY - MegaRAID on an-worker1083 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [09:23:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T318950)', diff saved to https://phabricator.wikimedia.org/P36692 and previous config saved to /var/cache/conftool/dbconfig/20221027-092355-ladsgroup.json [09:24:02] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [09:24:48] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host puppetdb-test2001.codfw.wmnet [09:26:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P36693 and previous config saved to /var/cache/conftool/dbconfig/20221027-092636-ladsgroup.json [09:26:45] PROBLEM - Check systemd state on ms-be1059 is CRITICAL: CRITICAL - degraded: The following units failed: swift-object.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:28:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117', diff saved to https://phabricator.wikimedia.org/P36694 and previous config saved to /var/cache/conftool/dbconfig/20221027-092842-marostegui.json [09:29:56] PROBLEM - Check systemd state on mw2334 is CRITICAL: CRITICAL - degraded: The following units failed: php7.4-fpm_check_restart.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:32:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1025 (T321312)', diff saved to https://phabricator.wikimedia.org/P36696 and previous config saved to /var/cache/conftool/dbconfig/20221027-093250-ladsgroup.json [09:34:21] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host puppetdb-test2001.codfw.wmnet [09:35:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db2117', diff saved to https://phabricator.wikimedia.org/P36697 and previous config saved to /var/cache/conftool/dbconfig/20221027-093519-marostegui.json [09:35:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2025', diff saved to https://phabricator.wikimedia.org/P36698 and previous config saved to /var/cache/conftool/dbconfig/20221027-093534-ladsgroup.json [09:37:12] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2097.codfw.wmnet with reason: Maintenance [09:37:25] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2097.codfw.wmnet with reason: Maintenance [09:37:45] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2104.codfw.wmnet with reason: Maintenance [09:37:58] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2104.codfw.wmnet with reason: Maintenance [09:38:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2104 (T321123)', diff saved to https://phabricator.wikimedia.org/P36699 and previous config saved to /var/cache/conftool/dbconfig/20221027-093804-marostegui.json [09:38:10] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [09:38:38] !log jmm@cumin2002 START - Cookbook sre.ldap.roll-restart-reboot-replica rolling reboot on A:ldap-replicas-codfw [09:39:01] (03CR) 10Giuseppe Lavagetto: [C: 03+1] kubernetes: Rename mwdebug to mw-debug (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/849502 (https://phabricator.wikimedia.org/T321201) (owner: 10Clément Goubert) [09:39:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P36700 and previous config saved to /var/cache/conftool/dbconfig/20221027-093902-ladsgroup.json [09:39:25] (03CR) 10David Caro: wmcs: changes to api service to manage toolforge replica.my.cnf (039 comments) [puppet] - 10https://gerrit.wikimedia.org/r/810965 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [09:40:14] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote pc2014 to pc1 codfw master [puppet] - 10https://gerrit.wikimedia.org/r/850041 (owner: 10Marostegui) [09:40:19] (03CR) 10Ladsgroup: [C: 03+1] ProductionServices.php: Promote pc2014 to pc1 codfw master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850040 (owner: 10Marostegui) [09:40:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104 (T321123)', diff saved to https://phabricator.wikimedia.org/P36701 and previous config saved to /var/cache/conftool/dbconfig/20221027-094030-marostegui.json [09:40:35] (03CR) 10Marostegui: [C: 03+2] ProductionServices.php: Promote pc2014 to pc1 codfw master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850040 (owner: 10Marostegui) [09:41:25] (03Merged) 10jenkins-bot: ProductionServices.php: Promote pc2014 to pc1 codfw master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850040 (owner: 10Marostegui) [09:41:31] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by marostegui@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850040 (owner: 10Marostegui) [09:41:42] !log marostegui@deploy1002 Started scap: Backport for [[gerrit:850040|ProductionServices.php: Promote pc2014 to pc1 codfw master]] [09:41:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P36702 and previous config saved to /var/cache/conftool/dbconfig/20221027-094143-ladsgroup.json [09:41:51] (03PS3) 10AikoChou: ml-services: add revert-risk-model isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/849627 (https://phabricator.wikimedia.org/T321594) [09:42:01] !log marostegui@deploy1002 marostegui and marostegui: Backport for [[gerrit:850040|ProductionServices.php: Promote pc2014 to pc1 codfw master]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [09:46:08] !log marostegui@deploy1002 Finished scap: Backport for [[gerrit:850040|ProductionServices.php: Promote pc2014 to pc1 codfw master]] (duration: 04m 26s) [09:46:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.ldap.roll-restart-reboot-replica (exit_code=0) rolling reboot on A:ldap-replicas-codfw [09:46:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T318605)', diff saved to https://phabricator.wikimedia.org/P36703 and previous config saved to /var/cache/conftool/dbconfig/20221027-094655-ladsgroup.json [09:47:01] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [09:47:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [09:47:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1025', diff saved to https://phabricator.wikimedia.org/P36704 and previous config saved to /var/cache/conftool/dbconfig/20221027-094756-ladsgroup.json [09:48:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [09:48:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [09:48:09] (03CR) 10Jbond: "recheck" [cookbooks] - 10https://gerrit.wikimedia.org/r/850038 (owner: 10Jbond) [09:49:05] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [09:50:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2025', diff saved to https://phabricator.wikimedia.org/P36705 and previous config saved to /var/cache/conftool/dbconfig/20221027-095041-ladsgroup.json [09:52:06] (03CR) 10Elukey: [C: 03+2] ml-services: add revert-risk-model isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/849627 (https://phabricator.wikimedia.org/T321594) (owner: 10AikoChou) [09:52:11] (03PS1) 10Marostegui: Revert "mariadb: Promote pc2014 to pc1 codfw master" [puppet] - 10https://gerrit.wikimedia.org/r/850069 [09:52:23] (03PS1) 10Vgutierrez: trafficserver: Clean up after ATS 9.x upgrade [puppet] - 10https://gerrit.wikimedia.org/r/850087 (https://phabricator.wikimedia.org/T321776) [09:52:28] (03PS1) 10Marostegui: Revert "ProductionServices.php: Promote pc2014 to pc1 codfw master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850070 [09:53:01] (03CR) 10CI reject: [V: 04-1] trafficserver: Clean up after ATS 9.x upgrade [puppet] - 10https://gerrit.wikimedia.org/r/850087 (https://phabricator.wikimedia.org/T321776) (owner: 10Vgutierrez) [09:53:50] RECOVERY - Check systemd state on wcqs2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:54:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P36706 and previous config saved to /var/cache/conftool/dbconfig/20221027-095408-ladsgroup.json [09:54:40] !log jmm@cumin2002 START - Cookbook sre.ldap.roll-restart-reboot-replica rolling reboot on A:ldap-replicas-eqiad [09:55:20] RECOVERY - SSH on mw1338.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:55:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104', diff saved to https://phabricator.wikimedia.org/P36707 and previous config saved to /var/cache/conftool/dbconfig/20221027-095537-marostegui.json [09:55:43] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host krb2002.codfw.wmnet [09:56:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T318950)', diff saved to https://phabricator.wikimedia.org/P36708 and previous config saved to /var/cache/conftool/dbconfig/20221027-095649-ladsgroup.json [09:56:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance [09:56:54] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance [09:56:55] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [09:56:59] (03CR) 10Jbond: doc: add parameters for gitlab_runner and contint hosts, allow them (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/849699 (https://phabricator.wikimedia.org/T321629) (owner: 10Dzahn) [09:57:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1175 (T318950)', diff saved to https://phabricator.wikimedia.org/P36709 and previous config saved to /var/cache/conftool/dbconfig/20221027-095700-ladsgroup.json [09:57:28] (03CR) 10Jbond: "recheck" [cookbooks] - 10https://gerrit.wikimedia.org/r/850038 (owner: 10Jbond) [10:00:04] mvolz: #bothumor I � Unicode. All rise for Services – Citoid / Zotero deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221027T1000). [10:00:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host krb2002.codfw.wmnet [10:00:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T318950)', diff saved to https://phabricator.wikimedia.org/P36710 and previous config saved to /var/cache/conftool/dbconfig/20221027-100057-ladsgroup.json [10:02:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P36711 and previous config saved to /var/cache/conftool/dbconfig/20221027-100201-ladsgroup.json [10:02:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.ldap.roll-restart-reboot-replica (exit_code=0) rolling reboot on A:ldap-replicas-eqiad [10:03:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1025', diff saved to https://phabricator.wikimedia.org/P36712 and previous config saved to /var/cache/conftool/dbconfig/20221027-100303-ladsgroup.json [10:03:28] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1059.eqiad.wmnet [10:03:39] (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/842359 (https://phabricator.wikimedia.org/T310266) (owner: 10Filippo Giunchedi) [10:05:30] (03CR) 10Filippo Giunchedi: [C: 03+2] customscripts: exclude decommissioning hosts from mgmt data [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/842359 (https://phabricator.wikimedia.org/T310266) (owner: 10Filippo Giunchedi) [10:05:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2025 (T321312)', diff saved to https://phabricator.wikimedia.org/P36713 and previous config saved to /var/cache/conftool/dbconfig/20221027-100547-ladsgroup.json [10:06:44] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10MoritzMuehlenhoff) [10:08:03] (03PS2) 10Clément Goubert: mediawiki: Create new mw-debug deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/849501 (https://phabricator.wikimedia.org/T321201) [10:08:05] (03CR) 10Marostegui: [C: 03+2] Revert "mariadb: Promote pc2014 to pc1 codfw master" [puppet] - 10https://gerrit.wikimedia.org/r/850069 (owner: 10Marostegui) [10:08:16] (03CR) 10Marostegui: [C: 03+2] Revert "ProductionServices.php: Promote pc2014 to pc1 codfw master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850070 (owner: 10Marostegui) [10:09:14] (03Merged) 10jenkins-bot: Revert "ProductionServices.php: Promote pc2014 to pc1 codfw master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850070 (owner: 10Marostegui) [10:09:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T318950)', diff saved to https://phabricator.wikimedia.org/P36714 and previous config saved to /var/cache/conftool/dbconfig/20221027-100915-ladsgroup.json [10:09:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2177.codfw.wmnet with reason: Maintenance [10:09:19] !log filippo@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "sync-mgmt - filippo@cumin1001" [10:09:21] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [10:09:23] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by marostegui@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850070 (owner: 10Marostegui) [10:09:31] !log marostegui@deploy1002 Started scap: Backport for [[gerrit:850070|Revert "ProductionServices.php: Promote pc2014 to pc1 codfw master"]] [10:09:42] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2177.codfw.wmnet with reason: Maintenance [10:09:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2177 (T318950)', diff saved to https://phabricator.wikimedia.org/P36715 and previous config saved to /var/cache/conftool/dbconfig/20221027-100948-ladsgroup.json [10:09:50] !log marostegui@deploy1002 marostegui and marostegui: Backport for [[gerrit:850070|Revert "ProductionServices.php: Promote pc2014 to pc1 codfw master"]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [10:10:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104', diff saved to https://phabricator.wikimedia.org/P36716 and previous config saved to /var/cache/conftool/dbconfig/20221027-101043-marostegui.json [10:11:32] (03PS3) 10Clément Goubert: kubernetes: Rename mwdebug to mw-debug [puppet] - 10https://gerrit.wikimedia.org/r/849502 (https://phabricator.wikimedia.org/T321201) [10:12:31] !log filippo@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "sync-mgmt - filippo@cumin1001" [10:12:43] (03CR) 10Clément Goubert: kubernetes: Rename mwdebug to mw-debug (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/849502 (https://phabricator.wikimedia.org/T321201) (owner: 10Clément Goubert) [10:13:36] godog: \o/ [10:13:37] (03PS1) 10Marostegui: pc2014: Move it to pc2 [puppet] - 10https://gerrit.wikimedia.org/r/850090 [10:13:53] (03CR) 10CI reject: [V: 04-1] kubernetes: Rename mwdebug to mw-debug [puppet] - 10https://gerrit.wikimedia.org/r/849502 (https://phabricator.wikimedia.org/T321201) (owner: 10Clément Goubert) [10:14:00] !log marostegui@deploy1002 Finished scap: Backport for [[gerrit:850070|Revert "ProductionServices.php: Promote pc2014 to pc1 codfw master"]] (duration: 04m 29s) [10:14:06] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host stat1009.eqiad.wmnet [10:14:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [10:15:01] PROBLEM - MegaRAID on an-worker1083 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [10:15:02] (03PS4) 10Clément Goubert: kubernetes: Rename mwdebug to mw-debug [puppet] - 10https://gerrit.wikimedia.org/r/849502 (https://phabricator.wikimedia.org/T321201) [10:15:12] (03CR) 10Marostegui: [C: 03+2] pc2014: Move it to pc2 [puppet] - 10https://gerrit.wikimedia.org/r/850090 (owner: 10Marostegui) [10:15:19] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [10:15:20] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [10:16:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P36717 and previous config saved to /var/cache/conftool/dbconfig/20221027-101604-ladsgroup.json [10:16:19] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [10:17:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P36718 and previous config saved to /var/cache/conftool/dbconfig/20221027-101708-ladsgroup.json [10:17:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T318950)', diff saved to https://phabricator.wikimedia.org/P36719 and previous config saved to /var/cache/conftool/dbconfig/20221027-101742-ladsgroup.json [10:17:47] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [10:17:48] volans: \o/ indeed! hiera data updated [10:18:15] (03CR) 10Clément Goubert: "This change is ready for review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/842819 (https://phabricator.wikimedia.org/T316348) (owner: 10Clément Goubert) [10:18:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote es2030 as es1 master, es2031 as es2 master, es2029 as es3 master', diff saved to https://phabricator.wikimedia.org/P36720 and previous config saved to /var/cache/conftool/dbconfig/20221027-101842-marostegui.json [10:18:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1025 (T321312)', diff saved to https://phabricator.wikimedia.org/P36721 and previous config saved to /var/cache/conftool/dbconfig/20221027-101848-ladsgroup.json [10:18:53] PROBLEM - Check systemd state on wcqs2003 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:19:51] 10SRE, 10ContentTranslation, 10Machine-Learning-Team, 10Wikimedia Enterprise: Run NLLB-200 model in a new instance - https://phabricator.wikimedia.org/T321781 (10Pginer-WMF) [10:20:33] 10SRE, 10ContentTranslation, 10Machine-Learning-Team, 10Wikimedia Enterprise: Run NLLB-200 model in a new instance - https://phabricator.wikimedia.org/T321781 (10Pginer-WMF) [10:21:04] 10SRE, 10Infrastructure-Foundations: Setup an initial bookworm host with Puppetdb 7 - https://phabricator.wikimedia.org/T321783 (10MoritzMuehlenhoff) [10:21:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host stat1009.eqiad.wmnet [10:21:28] (03PS1) 10Muehlenhoff: debian: Add bookworm [puppet] - 10https://gerrit.wikimedia.org/r/850092 (https://phabricator.wikimedia.org/T321783) [10:22:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es2026 es2027 es2028 for upgrade', diff saved to https://phabricator.wikimedia.org/P36722 and previous config saved to /var/cache/conftool/dbconfig/20221027-102209-root.json [10:23:55] RECOVERY - Check systemd state on wcqs2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:24:25] PROBLEM - Check systemd state on ms-be1059 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:25:15] (03PS1) 10Jbond: aptrepo: Add component pyall [puppet] - 10https://gerrit.wikimedia.org/r/850093 [10:25:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104 (T321123)', diff saved to https://phabricator.wikimedia.org/P36723 and previous config saved to /var/cache/conftool/dbconfig/20221027-102550-marostegui.json [10:25:52] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2125.codfw.wmnet with reason: Maintenance [10:25:56] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [10:26:05] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2125.codfw.wmnet with reason: Maintenance [10:26:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2125 (T321123)', diff saved to https://phabricator.wikimedia.org/P36724 and previous config saved to /var/cache/conftool/dbconfig/20221027-102611-marostegui.json [10:28:17] RECOVERY - Check systemd state on ms-be1059 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:28:31] (03CR) 10Awight: [C: 03+1] Enable show nearby feature on de.wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842353 (https://phabricator.wikimedia.org/T320692) (owner: 10WMDE-Fisch) [10:28:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125 (T321123)', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20221027-102837-marostegui.json [10:28:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2026 (re)pooling @ 1%: After upgrade', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20221027-102843-root.json [10:28:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2027 (re)pooling @ 1%: After upgrade', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20221027-102847-root.json [10:28:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2028 (re)pooling @ 1%: After upgrade', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20221027-102852-root.json [10:29:42] marostegui@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [10:29:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [10:30:12] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method= [10:30:18] (ProbeDown) firing: (3) Service api-https:443 has failed probes (http_api-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:30:19] (ProbeDown) firing: (12) Service api-https:443 has failed probes (http_api-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:30:35] (FrontendUnavailable) firing: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [10:30:35] (FrontendUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [10:30:54] Hm. [10:30:58] PROBLEM - restbase endpoints health on restbase2027 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get rev by title from storage returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:31:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P36725 and previous config saved to /var/cache/conftool/dbconfig/20221027-103110-ladsgroup.json [10:31:12] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [10:31:12] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [10:31:16] PROBLEM - PyBal backends health check on lvs3005 is CRITICAL: PYBAL CRITICAL - CRITICAL - textlb6_443: Servers cp3054.esams.wmnet, cp3062.esams.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:31:16] PROBLEM - PyBal backends health check on lvs1017 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp1087.eqiad.wmnet, cp1075.eqiad.wmnet, cp1089.eqiad.wmnet are marked down but pooled: textlb_443: Servers cp1087.eqiad.wmnet, cp1075.eqiad.wmnet, cp1089.eqiad.wmnet are marked down but pooled: testlb6_443: Servers cp1087.eqiad.wmnet, cp1089.eqiad.wmnet are marked down but pooled: textlb6_443: Servers cp1087.eqiad.wmnet, cp1075.eq [10:31:16] t, cp1089.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:31:28] PROBLEM - PyBal backends health check on lvs5003 is CRITICAL: PYBAL CRITICAL - CRITICAL - testlb_443: Servers cp5009.eqsin.wmnet, cp5011.eqsin.wmnet, cp5008.eqsin.wmnet, cp5015.eqsin.wmnet, cp5012.eqsin.wmnet, cp5010.eqsin.wmnet are marked down but pooled: textlb_443: Servers cp5009.eqsin.wmnet, cp5011.eqsin.wmnet, cp5016.eqsin.wmnet, cp5015.eqsin.wmnet, cp5012.eqsin.wmnet, cp5007.eqsin.wmnet, cp5010.eqsin.wmnet are marked down but pooled [10:31:28] 6_443: Servers cp5009.eqsin.wmnet, cp5011.eqsin.wmnet, cp5008.eqsin.wmnet, cp5015.eqsin.wmnet, cp5012.eqsin.wmnet, cp5007.eqsin.wmnet, cp5010.eqsin.wmnet are marked down but pooled: textlb6_443: Servers cp5009.eqsin.wmnet, cp5011.eqsin.wmnet, cp5008.eqsin.wmnet, cp5015.eqsin.wmnet, cp5012.eqsin.wmnet, cp5010.eqsin.wmnet, cp5007.eqsin.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:31:32] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at codfw on alert1001 is CRITICAL: 1 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [10:31:32] PROBLEM - restbase endpoints health on restbase1028 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:31:32] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.9677 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [10:31:32] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:31:33] PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:31:34] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/description/{title} (Get description for test page) timed out before a response was received: /{domain}/v1/page/media-list/{title} (Get media list from test page) timed out before a response was received: /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) timed out before a response was received: /{domain}/v1/page/summary/{ [10:31:34] Get summary for test page) timed out before a response was received: /{domain}/v1/transform/html/to/mobile-html/{title} (Get preview mobile HTML for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [10:31:34] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:31:34] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - api-https_443: Servers mw1422.eqiad.wmnet, mw1426.eqiad.wmnet, mw1346.eqiad.wmnet, mw1380.eqiad.wmnet, mw1390.eqiad.wmnet, mw1447.eqiad.wmnet, mw1427.eqiad.wmnet, mw1361.eqiad.wmnet, mw1406.eqiad.wmnet, mw1374.eqiad.wmnet, mw1344.eqiad.wmnet, mw1428.eqiad.wmnet, mw1388.eqiad.wmnet, mw1358.eqiad.wmnet, mw1386.eqiad.wmnet, mw1348.eqiad.wmnet, mw1402.eq [10:31:34] t, mw1448.eqiad.wmnet, mw1381.eqiad.wmnet, mw1362.eqiad.wmnet, mw1340.eqiad.wmnet, mw1449.eqiad.wmnet, mw1343.eqiad.wmnet, mw1421.eqiad.wmnet, mw1347.eqiad.wmnet, mw1377.eqiad.wmnet, mw1345.eqiad.wmnet, mw1375.eqiad.wmnet, mw1314.eqiad.wmnet, mw1424.eqiad.wmnet, mw1412.eqiad.wmnet, mw1398.eqiad.wmnet, mw1408.eqiad.wmnet, mw1404.eqiad.wmnet, mw1376.eqiad.wmnet, mw1363.eqiad.wmnet, mw1357.eqiad.wmnet, mw1423.eqiad.wmnet, mw1317.eqiad.wmnet, [10:31:35] eqiad.wmnet, mw1316.eqiad.wmnet, mw1379.eqiad.wmnet, mw1396.eqiad.wmnet, mw1312.eqiad.wmnet, mw1394.eqiad.wmnet, mw1383.eqiad.wmnet, mw1400.eqiad.wmnet, mw1392.eqiad.wmnet, mw1443.eqiad https://wikitech.wikimedia.org/wiki/PyBal [10:31:40] PROBLEM - restbase endpoints health on restbase2016 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:31:42] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/translate/{from}/{to}/{provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received: /v2/suggest/sections/{title}/{from}/{to} (Suggest source sections to translate) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [10:31:42] PROBLEM - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) timed out before a response was received: /{domain}/v1/page/random/title (retrieve a random article title) timed out be [10:31:42] esponse was received https://wikitech.wikimedia.org/wiki/Wikifeeds [10:32:02] RECOVERY - restbase endpoints health on restbase2027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:32:06] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [10:32:06] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/media-list/{title} (Get media list from test page) timed out before a response was received: /{domain}/v1/page/mobile-html/{title} (Get page content HTML for test page) timed out before a response was received: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/ [10:32:06] ps_%28service%29 [10:32:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T318605)', diff saved to https://phabricator.wikimedia.org/P36726 and previous config saved to /var/cache/conftool/dbconfig/20221027-103214-ladsgroup.json [10:32:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2129.codfw.wmnet with reason: Maintenance [10:32:18] PROBLEM - PHP7 rendering on mw2416 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:32:20] RECOVERY - PyBal backends health check on lvs3005 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:32:20] RECOVERY - PyBal backends health check on lvs1017 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:32:20] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [10:32:30] RECOVERY - restbase endpoints health on restbase1028 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:32:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2129.codfw.wmnet with reason: Maintenance [10:32:30] RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:32:32] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:32:32] RECOVERY - PyBal backends health check on lvs5003 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:32:32] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [10:32:32] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:32:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2129 (T318605)', diff saved to https://phabricator.wikimedia.org/P36727 and previous config saved to /var/cache/conftool/dbconfig/20221027-103236-ladsgroup.json [10:32:40] RECOVERY - restbase endpoints health on restbase2016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:32:40] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:32:40] RECOVERY - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [10:32:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P36728 and previous config saved to /var/cache/conftool/dbconfig/20221027-103248-ladsgroup.json [10:32:57] (PHPFPMTooBusy) firing: (2) Not enough idle php7.4-fpm.service workers for Mediawiki api_appserver at codfw #page - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:33:00] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [10:33:10] RECOVERY - PHP7 rendering on mw2416 is OK: HTTP OK: HTTP/1.1 302 Found - 516 bytes in 0.092 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [10:33:10] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [10:34:42] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at codfw on alert1001 is OK: All metrics within thresholds. https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [10:34:52] PROBLEM - very high load average likely xfs on ms-be1059 is CRITICAL: CRITICAL - load average: 117.09, 101.62, 59.00 https://wikitech.wikimedia.org/wiki/Swift [10:35:14] ^ ouch? :/ [10:35:18] (ProbeDown) resolved: (17) Service commons.wikimedia.org:443 has failed probes (http_commons_wikimedia_org_ip4) #page - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:35:19] (ProbeDown) resolved: (17) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:35:20] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [10:35:35] (FrontendUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [10:35:35] (FrontendUnavailable) resolved: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [10:35:48] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.08065 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [10:35:50] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [10:35:58] !log mvernon@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host ms-be1059.eqiad.wmnet [10:37:59] (PHPFPMTooBusy) resolved: (2) Not enough idle php7.4-fpm.service workers for Mediawiki api_appserver at codfw #page - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [10:39:26] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/translate/{from}/{to} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received: /v2/suggest/source/{title}/{to} (Suggest a source title to use for translation) is CRITICAL: Test Suggest a source title to use for translation returned the unexpected status 503 (expecting: 200): /v2/suggest [10:39:26] s/{title}/{from}/{to} (Suggest source sections to translate) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [10:41:24] RECOVERY - SSH on mw1326.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:41:37] 10SRE, 10Beta-Cluster-Infrastructure, 10Beta-Cluster-reproducible: Thumbnails on beta cluster return 503 Service Unavailable - https://phabricator.wikimedia.org/T321654 (10TheresNoTime) Just noting the things I've tried (unsuccessfully): - Purging the varnish cache for this file - Deleting some generated t... [10:41:37] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host stat1010.eqiad.wmnet [10:43:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125', diff saved to https://phabricator.wikimedia.org/P36729 and previous config saved to /var/cache/conftool/dbconfig/20221027-104348-marostegui.json [10:43:52] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Deploy mediawiki kubernetes services - https://phabricator.wikimedia.org/T321786 (10Clement_Goubert) [10:43:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2026 (re)pooling @ 3%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36730 and previous config saved to /var/cache/conftool/dbconfig/20221027-104352-root.json [10:43:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2027 (re)pooling @ 3%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36731 and previous config saved to /var/cache/conftool/dbconfig/20221027-104356-root.json [10:44:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2028 (re)pooling @ 3%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36732 and previous config saved to /var/cache/conftool/dbconfig/20221027-104402-root.json [10:44:38] PROBLEM - Disk space on cp5007 is CRITICAL: DISK CRITICAL - free space: / 276 MB (3% inode=62%): /tmp 276 MB (3% inode=62%): /var/tmp 276 MB (3% inode=62%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=cp5007&var-datasource=eqsin+prometheus/ops [10:44:40] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [10:44:55] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Deploy mediawiki kubernetes services - https://phabricator.wikimedia.org/T321786 (10Clement_Goubert) 05Open→03In progress p:05Triage→03High [10:45:07] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Clement_Goubert) [10:45:27] (03PS1) 10Clément Goubert: hieradata: Add usernames for mw on k8s services [puppet] - 10https://gerrit.wikimedia.org/r/850094 (https://phabricator.wikimedia.org/T321786) [10:46:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T318950)', diff saved to https://phabricator.wikimedia.org/P36733 and previous config saved to /var/cache/conftool/dbconfig/20221027-104617-ladsgroup.json [10:46:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1179.eqiad.wmnet with reason: Maintenance [10:46:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1179.eqiad.wmnet with reason: Maintenance [10:46:22] RECOVERY - MegaRAID on an-worker1083 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [10:46:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1179 (T318950)', diff saved to https://phabricator.wikimedia.org/P36734 and previous config saved to /var/cache/conftool/dbconfig/20221027-104627-ladsgroup.json [10:46:50] PROBLEM - Check systemd state on wcqs2003 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:47:16] (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [10:47:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P36735 and previous config saved to /var/cache/conftool/dbconfig/20221027-104755-ladsgroup.json [10:48:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host stat1010.eqiad.wmnet [10:48:35] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM, but a few questions inline." [puppet] - 10https://gerrit.wikimedia.org/r/848356 (https://phabricator.wikimedia.org/T316541) (owner: 10David Caro) [10:49:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [10:50:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T318950)', diff saved to https://phabricator.wikimedia.org/P36736 and previous config saved to /var/cache/conftool/dbconfig/20221027-105024-ladsgroup.json [10:50:48] (03PS1) 10Clément Goubert: admin: add mw on kubernetes namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/850095 (https://phabricator.wikimedia.org/T321786) [10:50:57] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] p::ceph:mon: set permissions if mgr key parent dirs [puppet] - 10https://gerrit.wikimedia.org/r/849032 (https://phabricator.wikimedia.org/T321514) (owner: 10David Caro) [10:52:07] (03CR) 10CI reject: [V: 04-1] admin: add mw on kubernetes namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/850095 (https://phabricator.wikimedia.org/T321786) (owner: 10Clément Goubert) [10:52:30] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Duplicate monitoring for systemd::timer::job - https://phabricator.wikimedia.org/T303253 (10jcrespo) > we want to move "systemd unit failed" off Icinga and onto AM too This is higher level, and out of scope of this ticket, but I wonder if... [10:54:44] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:54:54] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:57:06] (03CR) 10Svantje Lilienthal: [C: 03+1] Enable show nearby feature on de.wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842353 (https://phabricator.wikimedia.org/T320692) (owner: 10WMDE-Fisch) [10:57:26] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:57:58] RECOVERY - very high load average likely xfs on ms-be1059 is OK: OK - load average: 36.40, 61.19, 78.66 https://wikitech.wikimedia.org/wiki/Swift [10:58:30] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 22 Dec 2022 06:15:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:58:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125', diff saved to https://phabricator.wikimedia.org/P36737 and previous config saved to /var/cache/conftool/dbconfig/20221027-105855-marostegui.json [10:59:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2027 (re)pooling @ 5%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36738 and previous config saved to /var/cache/conftool/dbconfig/20221027-105901-root.json [10:59:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2028 (re)pooling @ 5%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36739 and previous config saved to /var/cache/conftool/dbconfig/20221027-105907-root.json [10:59:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2026 (re)pooling @ 5%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36740 and previous config saved to /var/cache/conftool/dbconfig/20221027-105910-root.json [11:02:28] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Enable show nearby feature on de.wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842353 (https://phabricator.wikimedia.org/T320692) (owner: 10WMDE-Fisch) [11:03:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T318950)', diff saved to https://phabricator.wikimedia.org/P36742 and previous config saved to /var/cache/conftool/dbconfig/20221027-110301-ladsgroup.json [11:03:07] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [11:03:27] (03PS6) 10Samtar: enwiki: Raise wgPageTriageMaxAge to indefinite [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808424 (https://phabricator.wikimedia.org/T310974) (owner: 10Stang) [11:04:18] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [11:04:32] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:05:12] !log installing nodejs security updates on buster [11:05:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P36743 and previous config saved to /var/cache/conftool/dbconfig/20221027-110531-ladsgroup.json [11:06:16] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [11:06:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129 (T318605)', diff saved to https://phabricator.wikimedia.org/P36744 and previous config saved to /var/cache/conftool/dbconfig/20221027-110638-ladsgroup.json [11:06:44] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [11:09:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es1021.eqiad.wmnet with reason: Maintenance [11:09:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es1021.eqiad.wmnet with reason: Maintenance [11:09:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling es1021 (T321312)', diff saved to https://phabricator.wikimedia.org/P36745 and previous config saved to /var/cache/conftool/dbconfig/20221027-110920-ladsgroup.json [11:09:49] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es2021.codfw.wmnet with reason: Maintenance [11:10:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2021.codfw.wmnet with reason: Maintenance [11:10:06] (03PS1) 10Kosta Harlan: [labs] GrowthExperiments: Use d3.js with new impact module [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850098 (https://phabricator.wikimedia.org/T318854) [11:10:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling es2021 (T321312)', diff saved to https://phabricator.wikimedia.org/P36746 and previous config saved to /var/cache/conftool/dbconfig/20221027-111009-ladsgroup.json [11:10:22] RECOVERY - Host netflow1002 is UP: PING OK - Packet loss = 0%, RTA = 0.42 ms [11:10:54] PROBLEM - Check systemd state on netflow1002 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:11:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2097.codfw.wmnet with reason: Maintenance [11:11:24] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2097.codfw.wmnet with reason: Maintenance [11:11:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2102.codfw.wmnet with reason: Maintenance [11:11:41] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2102.codfw.wmnet with reason: Maintenance [11:11:45] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2103.codfw.wmnet with reason: Maintenance [11:11:59] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2103.codfw.wmnet with reason: Maintenance [11:12:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2103 (T318950)', diff saved to https://phabricator.wikimedia.org/P36747 and previous config saved to /var/cache/conftool/dbconfig/20221027-111204-ladsgroup.json [11:12:10] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [11:14:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125 (T321123)', diff saved to https://phabricator.wikimedia.org/P36748 and previous config saved to /var/cache/conftool/dbconfig/20221027-111401-marostegui.json [11:14:04] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2126.codfw.wmnet with reason: Maintenance [11:14:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2027 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36749 and previous config saved to /var/cache/conftool/dbconfig/20221027-111406-root.json [11:14:08] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [11:14:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2028 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36750 and previous config saved to /var/cache/conftool/dbconfig/20221027-111412-root.json [11:14:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2103 (T318950)', diff saved to https://phabricator.wikimedia.org/P36751 and previous config saved to /var/cache/conftool/dbconfig/20221027-111414-ladsgroup.json [11:14:17] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2126.codfw.wmnet with reason: Maintenance [11:14:19] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on db2095.codfw.wmnet with reason: Maintenance [11:14:21] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on db2095.codfw.wmnet with reason: Maintenance [11:14:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2026 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36752 and previous config saved to /var/cache/conftool/dbconfig/20221027-111422-root.json [11:14:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2126 (T321123)', diff saved to https://phabricator.wikimedia.org/P36753 and previous config saved to /var/cache/conftool/dbconfig/20221027-111427-marostegui.json [11:15:04] !log oblivian@cumin1001 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on P{lvs4005.ulsfo.wmnet} and A:lvs [11:15:06] !log oblivian@cumin1001 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on P{lvs4005.ulsfo.wmnet} and A:lvs [11:16:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126 (T321123)', diff saved to https://phabricator.wikimedia.org/P36754 and previous config saved to /var/cache/conftool/dbconfig/20221027-111653-marostegui.json [11:20:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P36755 and previous config saved to /var/cache/conftool/dbconfig/20221027-112037-ladsgroup.json [11:21:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129', diff saved to https://phabricator.wikimedia.org/P36756 and previous config saved to /var/cache/conftool/dbconfig/20221027-112144-ladsgroup.json [11:22:20] (03PS3) 10Arturo Borrero Gonzalez: cloudgw2003-dev: give proper role [puppet] - 10https://gerrit.wikimedia.org/r/838125 (https://phabricator.wikimedia.org/T318704) [11:22:28] PROBLEM - Check whether ferm is active by checking the default input chain on netflow1002 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [11:23:16] RECOVERY - Check systemd state on wcqs2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:23:43] (03PS1) 10Stang: Define a default value for wgPageTriageMaxAge [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850106 (https://phabricator.wikimedia.org/T310974) [11:23:49] (03CR) 10CI reject: [V: 04-1] Localisation updates from https://translatewiki.net. [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/850105 (owner: 10L10n-bot) [11:24:02] TheresNoTime: ^ [11:24:09] thanks! [11:24:17] !log oblivian@cumin1001 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on P{lvs4005.ulsfo.wmnet} and A:lvs [11:24:18] !log oblivian@cumin1001 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on P{lvs4005.ulsfo.wmnet} and A:lvs [11:25:05] (03PS4) 10Arturo Borrero Gonzalez: cloudgw2003-dev: give proper role [puppet] - 10https://gerrit.wikimedia.org/r/838125 (https://phabricator.wikimedia.org/T318704) [11:26:25] (03CR) 10Novem Linguae: "Isn't this already covered by https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/808424/ ? That patch has both a default value" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850106 (https://phabricator.wikimedia.org/T310974) (owner: 10Stang) [11:26:28] (03PS5) 10Arturo Borrero Gonzalez: cloudgw2003-dev: give proper role and take over cloudgw2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/838125 (https://phabricator.wikimedia.org/T318704) [11:26:35] (03PS1) 10Giuseppe Lavagetto: sre.loadbalancer.restart-pybal: fix typo [cookbooks] - 10https://gerrit.wikimedia.org/r/850110 [11:26:53] (03CR) 10CI reject: [V: 04-1] sre.loadbalancer.restart-pybal: fix typo [cookbooks] - 10https://gerrit.wikimedia.org/r/850110 (owner: 10Giuseppe Lavagetto) [11:27:35] (03CR) 10Volans: [C: 03+1] "LOL, sorry missed that" [cookbooks] - 10https://gerrit.wikimedia.org/r/850110 (owner: 10Giuseppe Lavagetto) [11:27:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1021 (T321312)', diff saved to https://phabricator.wikimedia.org/P36757 and previous config saved to /var/cache/conftool/dbconfig/20221027-112740-ladsgroup.json [11:29:00] (03PS2) 10Giuseppe Lavagetto: sre.loadbalancer.restart-pybal: fix typo [cookbooks] - 10https://gerrit.wikimedia.org/r/850110 [11:29:06] PROBLEM - Host netflow1002 is DOWN: PING CRITICAL - Packet loss = 100% [11:29:09] (03CR) 10Giuseppe Lavagetto: [C: 03+2] sre.loadbalancer.restart-pybal: fix typo [cookbooks] - 10https://gerrit.wikimedia.org/r/850110 (owner: 10Giuseppe Lavagetto) [11:29:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2027 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36758 and previous config saved to /var/cache/conftool/dbconfig/20221027-112911-root.json [11:29:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2028 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36759 and previous config saved to /var/cache/conftool/dbconfig/20221027-112917-root.json [11:29:20] PROBLEM - Check systemd state on wcqs2003 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:29:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2026 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36760 and previous config saved to /var/cache/conftool/dbconfig/20221027-112927-root.json [11:29:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2103', diff saved to https://phabricator.wikimedia.org/P36761 and previous config saved to /var/cache/conftool/dbconfig/20221027-112927-ladsgroup.json [11:31:09] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/850092 (https://phabricator.wikimedia.org/T321783) (owner: 10Muehlenhoff) [11:31:36] (03PS3) 10Majavah: admin: Add wmcs-roots to cloudgw nodes [puppet] - 10https://gerrit.wikimedia.org/r/745952 [11:32:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126', diff saved to https://phabricator.wikimedia.org/P36762 and previous config saved to /var/cache/conftool/dbconfig/20221027-113159-marostegui.json [11:32:59] !log jmm@cumin2002 START - Cookbook sre.ganeti.changedisk for changing disk type of netflow1002.eqiad.wmnet to plain [11:35:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow1002.eqiad.wmnet to plain [11:35:03] (03Merged) 10jenkins-bot: sre.loadbalancer.restart-pybal: fix typo [cookbooks] - 10https://gerrit.wikimedia.org/r/850110 (owner: 10Giuseppe Lavagetto) [11:35:17] (03CR) 10Samtar: [C: 03+1] "lgtm!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850106 (https://phabricator.wikimedia.org/T310974) (owner: 10Stang) [11:35:26] !log oblivian@cumin1001 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on P{lvs4005.ulsfo.wmnet} and A:lvs [11:35:27] !log oblivian@cumin1001 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on P{lvs4005.ulsfo.wmnet} and A:lvs [11:35:31] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] admin: Add wmcs-roots to cloudgw nodes [puppet] - 10https://gerrit.wikimedia.org/r/745952 (owner: 10Majavah) [11:35:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T318950)', diff saved to https://phabricator.wikimedia.org/P36763 and previous config saved to /var/cache/conftool/dbconfig/20221027-113544-ladsgroup.json [11:35:46] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1189.eqiad.wmnet with reason: Maintenance [11:35:48] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1189.eqiad.wmnet with reason: Maintenance [11:35:50] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [11:35:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1189 (T318950)', diff saved to https://phabricator.wikimedia.org/P36764 and previous config saved to /var/cache/conftool/dbconfig/20221027-113554-ladsgroup.json [11:36:08] (03PS4) 10Klausman: [WIP] wikilabels: move Postgres DB to its own (non-wmcs) role [puppet] - 10https://gerrit.wikimedia.org/r/849095 (https://phabricator.wikimedia.org/T307389) [11:36:17] (03CR) 10Klausman: [WIP] wikilabels: move Postgres DB to its own (non-wmcs) role (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/849095 (https://phabricator.wikimedia.org/T307389) (owner: 10Klausman) [11:36:42] (03CR) 10CI reject: [V: 04-1] [WIP] wikilabels: move Postgres DB to its own (non-wmcs) role [puppet] - 10https://gerrit.wikimedia.org/r/849095 (https://phabricator.wikimedia.org/T307389) (owner: 10Klausman) [11:36:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129', diff saved to https://phabricator.wikimedia.org/P36765 and previous config saved to /var/cache/conftool/dbconfig/20221027-113651-ladsgroup.json [11:38:03] !log oblivian@cumin1001 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on P{lvs4005.ulsfo.wmnet} and A:lvs [11:38:53] (03PS5) 10Klausman: [WIP] wikilabels: move Postgres DB to its own (non-wmcs) role [puppet] - 10https://gerrit.wikimedia.org/r/849095 (https://phabricator.wikimedia.org/T307389) [11:39:24] RECOVERY - PyBal backends health check on lvs4005 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:39:52] !log oblivian@cumin1001 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on P{lvs4005.ulsfo.wmnet} and A:lvs [11:40:38] (03PS7) 10Samtar: enwiki: Raise wgPageTriageMaxAge to indefinite [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808424 (https://phabricator.wikimedia.org/T310974) (owner: 10Stang) [11:40:57] (03CR) 10CI reject: [V: 04-1] enwiki: Raise wgPageTriageMaxAge to indefinite [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808424 (https://phabricator.wikimedia.org/T310974) (owner: 10Stang) [11:41:06] (03PS6) 10Arturo Borrero Gonzalez: cloudgw2003-dev: give proper role and take over cloudgw2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/838125 (https://phabricator.wikimedia.org/T318704) [11:41:40] (03PS8) 10Stang: enwiki: Raise wgPageTriageMaxAge to indefinite [mediawiki-config] - 10https://gerrit.wikimedia.org/r/808424 (https://phabricator.wikimedia.org/T310974) [11:42:37] (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37802/console" [puppet] - 10https://gerrit.wikimedia.org/r/849095 (https://phabricator.wikimedia.org/T307389) (owner: 10Klausman) [11:42:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1021', diff saved to https://phabricator.wikimedia.org/P36767 and previous config saved to /var/cache/conftool/dbconfig/20221027-114246-ladsgroup.json [11:43:16] (03PS7) 10Arturo Borrero Gonzalez: cloudgw2003-dev: give proper role and take over cloudgw2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/838125 (https://phabricator.wikimedia.org/T318704) [11:43:55] thx [11:44:03] 10SRE, 10Wikimedia-Mailing-lists: lists.wikimedia.org returns "Internal Server Error" for some pages - https://phabricator.wikimedia.org/T321703 (10jcrespo) https://lists.wikimedia.org/ having issues again? I get a response, but it takes 47-48 seconds to return a 301. [11:44:03] * volans wrong chan [11:44:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2027 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36768 and previous config saved to /var/cache/conftool/dbconfig/20221027-114416-root.json [11:44:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2028 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36769 and previous config saved to /var/cache/conftool/dbconfig/20221027-114422-root.json [11:44:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2026 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36770 and previous config saved to /var/cache/conftool/dbconfig/20221027-114432-root.json [11:45:39] (03PS1) 10Arturo Borrero Gonzalez: wikimediacloud.org: refresh cloudgw server [dns] - 10https://gerrit.wikimedia.org/r/850116 (https://phabricator.wikimedia.org/T318704) [11:47:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126', diff saved to https://phabricator.wikimedia.org/P36771 and previous config saved to /var/cache/conftool/dbconfig/20221027-114706-marostegui.json [11:47:07] 10SRE, 10Wikimedia-Mailing-lists: lists.wikimedia.org returns "Internal Server Error" for some pages - https://phabricator.wikimedia.org/T321703 (10Ladsgroup) It has been slow, I haven't looked why, it can be either of these two: - Somehow db responses are slow (many junk users created? Many junk emails have... [11:49:57] 10SRE, 10SRE-Access-Requests: Requesting deployment group membership for mfossati - https://phabricator.wikimedia.org/T321772 (10SLyngshede-WMF) p:05Triage→03Medium [11:51:22] 10SRE, 10Data-Engineering-Operations, 10Data-Engineering-Planning, 10Mail, 10Patch-For-Review: Change the analytics-alerts email alias to a mailman distribution list - https://phabricator.wikimedia.org/T315486 (10BTullis) OK, thanks all. I'll make that change to the exim aliases file: `analytics-alerts:... [11:51:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129 (T318605)', diff saved to https://phabricator.wikimedia.org/P36772 and previous config saved to /var/cache/conftool/dbconfig/20221027-115157-ladsgroup.json [11:51:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2141.codfw.wmnet with reason: Maintenance [11:52:04] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [11:52:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2141.codfw.wmnet with reason: Maintenance [11:52:14] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin1001 is CRITICAL: CRITICAL: the following (20) node(s) change every puppet run: aqs2001, aqs2002, aqs2003, aqs2004, aqs2005, aqs2006, aqs2007, aqs2008, aqs2009, aqs2010, aqs2011, aqs2012, clouddumps1001, clouddumps1002, phab1004, releases1002, releases2002, relforge1003, relforge1004, wcqs2003 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_change [11:52:28] PROBLEM - MegaRAID on an-worker1083 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:53:10] PROBLEM - Check systemd state on cp5007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:53:54] RECOVERY - Check systemd state on wcqs2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:56:05] (03CR) 10WMDE-Fisch: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850118 (https://phabricator.wikimedia.org/T321307) (owner: 10WMDE-Fisch) [11:56:53] (03PS1) 10Filippo Giunchedi: Revert "prometheus: temp disable mgmt checks until hiera export script is fixed" [puppet] - 10https://gerrit.wikimedia.org/r/850075 [11:57:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1021', diff saved to https://phabricator.wikimedia.org/P36773 and previous config saved to /var/cache/conftool/dbconfig/20221027-115753-ladsgroup.json [11:59:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2027 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36774 and previous config saved to /var/cache/conftool/dbconfig/20221027-115920-root.json [11:59:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2028 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36775 and previous config saved to /var/cache/conftool/dbconfig/20221027-115927-root.json [11:59:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2026 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36776 and previous config saved to /var/cache/conftool/dbconfig/20221027-115936-root.json [11:59:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2103 (T318950)', diff saved to https://phabricator.wikimedia.org/P36777 and previous config saved to /var/cache/conftool/dbconfig/20221027-115939-ladsgroup.json [11:59:41] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2116.codfw.wmnet with reason: Maintenance [11:59:46] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [11:59:55] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2116.codfw.wmnet with reason: Maintenance [11:59:56] PROBLEM - Check systemd state on wcqs2003 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:00:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2116 (T318950)', diff saved to https://phabricator.wikimedia.org/P36778 and previous config saved to /var/cache/conftool/dbconfig/20221027-120001-ladsgroup.json [12:00:10] (FNMNotReported) firing: FastNetMon metrics not reported - https://wikitech.wikimedia.org/wiki/Fastnetmon - https://w.wiki/8oU - https://alerts.wikimedia.org/?q=alertname%3DFNMNotReported [12:00:29] (03CR) 10Filippo Giunchedi: [C: 03+2] Revert "prometheus: temp disable mgmt checks until hiera export script is fixed" [puppet] - 10https://gerrit.wikimedia.org/r/850075 (owner: 10Filippo Giunchedi) [12:00:34] (03PS2) 10Filippo Giunchedi: Revert "prometheus: temp disable mgmt checks until hiera export script is fixed" [puppet] - 10https://gerrit.wikimedia.org/r/850075 [12:00:57] !log oblivian@cumin1001 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on D{lvs200[7-8].codfw.wmnet} and A:lvs [12:01:53] !log oblivian@cumin1001 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on D{lvs200[7-8].codfw.wmnet} and A:lvs [12:02:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116 (T318950)', diff saved to https://phabricator.wikimedia.org/P36780 and previous config saved to /var/cache/conftool/dbconfig/20221027-120211-ladsgroup.json [12:02:14] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2138.codfw.wmnet with reason: Maintenance [12:02:18] 10SRE, 10Wikimedia-Mailing-lists: lists.wikimedia.org returns "Internal Server Error" for some pages - https://phabricator.wikimedia.org/T321703 (10jcrespo) Sample traffic (under NDA): {P36779} [12:02:28] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2138.codfw.wmnet with reason: Maintenance [12:02:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2138:3312 (T321123)', diff saved to https://phabricator.wikimedia.org/P36781 and previous config saved to /var/cache/conftool/dbconfig/20221027-120234-marostegui.json [12:02:41] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [12:03:26] (03CR) 10David Caro: "Do you have to change https://gerrit.wikimedia.org/g/operations/puppet/+/1cb0f1e4cf777795474dae711f03ed167949c3d3/hieradata/codfw/profile/" [puppet] - 10https://gerrit.wikimedia.org/r/838125 (https://phabricator.wikimedia.org/T318704) (owner: 10Arturo Borrero Gonzalez) [12:03:28] RECOVERY - MegaRAID on an-worker1083 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [12:04:12] (03CR) 10David Caro: cloudgw2003-dev: give proper role and take over cloudgw2001-dev (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/838125 (https://phabricator.wikimedia.org/T318704) (owner: 10Arturo Borrero Gonzalez) [12:04:58] (03CR) 10David Caro: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/850116 (https://phabricator.wikimedia.org/T318704) (owner: 10Arturo Borrero Gonzalez) [12:05:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312 (T321123)', diff saved to https://phabricator.wikimedia.org/P36782 and previous config saved to /var/cache/conftool/dbconfig/20221027-120500-marostegui.json [12:07:43] (03PS1) 10Ladsgroup: lists: Ban PetalBot from crawling [puppet] - 10https://gerrit.wikimedia.org/r/850119 (https://phabricator.wikimedia.org/T321703) [12:07:52] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 22 Dec 2022 06:15:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:08:54] (03PS8) 10Arturo Borrero Gonzalez: cloudgw2003-dev: give proper role and take over cloudgw2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/838125 (https://phabricator.wikimedia.org/T318704) [12:09:14] (03CR) 10Jcrespo: [C: 03+1] lists: Ban PetalBot from crawling (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/850119 (https://phabricator.wikimedia.org/T321703) (owner: 10Ladsgroup) [12:11:10] (03CR) 10Arturo Borrero Gonzalez: cloudgw2003-dev: give proper role and take over cloudgw2001-dev (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/838125 (https://phabricator.wikimedia.org/T318704) (owner: 10Arturo Borrero Gonzalez) [12:11:20] (03CR) 10Dzahn: doc: add parameters for gitlab_runner and contint hosts, allow them (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/849699 (https://phabricator.wikimedia.org/T321629) (owner: 10Dzahn) [12:13:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1021 (T321312)', diff saved to https://phabricator.wikimedia.org/P36783 and previous config saved to /var/cache/conftool/dbconfig/20221027-121259-ladsgroup.json [12:13:04] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es1022.eqiad.wmnet with reason: Maintenance [12:13:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es1022.eqiad.wmnet with reason: Maintenance [12:13:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling es1022 (T321312)', diff saved to https://phabricator.wikimedia.org/P36784 and previous config saved to /var/cache/conftool/dbconfig/20221027-121323-ladsgroup.json [12:13:55] (LogstashIngestSpike) firing: Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [12:14:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2027 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36785 and previous config saved to /var/cache/conftool/dbconfig/20221027-121425-root.json [12:14:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2028 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36786 and previous config saved to /var/cache/conftool/dbconfig/20221027-121432-root.json [12:14:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2026 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P36787 and previous config saved to /var/cache/conftool/dbconfig/20221027-121441-root.json [12:15:09] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wikimediacloud.org: refresh cloudgw server [dns] - 10https://gerrit.wikimedia.org/r/850116 (https://phabricator.wikimedia.org/T318704) (owner: 10Arturo Borrero Gonzalez) [12:15:34] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [12:15:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2021 (T321312)', diff saved to https://phabricator.wikimedia.org/P36788 and previous config saved to /var/cache/conftool/dbconfig/20221027-121550-ladsgroup.json [12:16:56] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.292 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:17:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116', diff saved to https://phabricator.wikimedia.org/P36789 and previous config saved to /var/cache/conftool/dbconfig/20221027-121717-ladsgroup.json [12:18:01] (03PS2) 10Ladsgroup: lists: Ban PetalBot from crawling [puppet] - 10https://gerrit.wikimedia.org/r/850119 (https://phabricator.wikimedia.org/T321703) [12:18:14] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48827 bytes in 0.108 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:18:55] (LogstashIngestSpike) resolved: Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [12:19:36] (03PS6) 10Klausman: [WIP] wikilabels: move Postgres DB to its own (non-wmcs) role [puppet] - 10https://gerrit.wikimedia.org/r/849095 (https://phabricator.wikimedia.org/T307389) [12:20:00] (03PS8) 10Dzahn: doc: add parameters for gitlab_runner and contint hosts, allow them [puppet] - 10https://gerrit.wikimedia.org/r/849699 (https://phabricator.wikimedia.org/T321629) [12:20:02] (03CR) 10Dzahn: doc: add parameters for gitlab_runner and contint hosts, allow them (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/849699 (https://phabricator.wikimedia.org/T321629) (owner: 10Dzahn) [12:20:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312', diff saved to https://phabricator.wikimedia.org/P36790 and previous config saved to /var/cache/conftool/dbconfig/20221027-122007-marostegui.json [12:20:26] (03PS3) 10Ladsgroup: lists: Ban PetalBot from crawling [puppet] - 10https://gerrit.wikimedia.org/r/850119 (https://phabricator.wikimedia.org/T321703) [12:20:30] (03CR) 10Ladsgroup: [C: 03+2] lists: Ban PetalBot from crawling [puppet] - 10https://gerrit.wikimedia.org/r/850119 (https://phabricator.wikimedia.org/T321703) (owner: 10Ladsgroup) [12:20:41] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] lists: Ban PetalBot from crawling [puppet] - 10https://gerrit.wikimedia.org/r/850119 (https://phabricator.wikimedia.org/T321703) (owner: 10Ladsgroup) [12:21:25] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] [beta] Enable Kartographer show nearby clustering [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850118 (https://phabricator.wikimedia.org/T321307) (owner: 10WMDE-Fisch) [12:22:45] (03CR) 10CI reject: [V: 04-1] doc: add parameters for gitlab_runner and contint hosts, allow them [puppet] - 10https://gerrit.wikimedia.org/r/849699 (https://phabricator.wikimedia.org/T321629) (owner: 10Dzahn) [12:22:55] 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: lists.wikimedia.org returns "Internal Server Error" for some pages - https://phabricator.wikimedia.org/T321703 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup For longer term, I'd like to add a couple more cores to this poor tiny VM that has two cores... [12:23:32] PROBLEM - SSH on mw1310.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:23:32] (03CR) 10Dzahn: [C: 03+1] "appledora changed the email in LDAP to the -ctr address. Let's get this merged" [puppet] - 10https://gerrit.wikimedia.org/r/845033 (https://phabricator.wikimedia.org/T321086) (owner: 10Herron) [12:23:49] (03PS7) 10Klausman: [WIP] wikilabels: move Postgres DB to its own (non-wmcs) role [puppet] - 10https://gerrit.wikimedia.org/r/849095 (https://phabricator.wikimedia.org/T307389) [12:24:27] (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37803/console" [puppet] - 10https://gerrit.wikimedia.org/r/849095 (https://phabricator.wikimedia.org/T307389) (owner: 10Klausman) [12:25:43] (03PS2) 10Dzahn: admin: add appledora to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/845033 (https://phabricator.wikimedia.org/T321086) (owner: 10Herron) [12:25:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T318950)', diff saved to https://phabricator.wikimedia.org/P36791 and previous config saved to /var/cache/conftool/dbconfig/20221027-122557-ladsgroup.json [12:26:04] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [12:26:28] (03CR) 10Dzahn: [C: 03+1] "amended to update email address" [puppet] - 10https://gerrit.wikimedia.org/r/845033 (https://phabricator.wikimedia.org/T321086) (owner: 10Herron) [12:27:23] (03CR) 10Dzahn: "nope. after adding the type back and using alias it's still back to "parameter 'gitlab_runner_hosts' expects an Array value, got String "" [puppet] - 10https://gerrit.wikimedia.org/r/849699 (https://phabricator.wikimedia.org/T321629) (owner: 10Dzahn) [12:28:34] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2002 is CRITICAL: CRITICAL: the following (20) node(s) change every puppet run: aqs2001, aqs2002, aqs2003, aqs2004, aqs2005, aqs2006, aqs2007, aqs2008, aqs2009, aqs2010, aqs2011, aqs2012, clouddumps1001, clouddumps1002, phab1004, releases1002, releases2002, relforge1003, relforge1004, wcqs2003 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_change [12:28:42] (03PS1) 10Ladsgroup: maintenance: Use $this->waitForReplication() [core] (wmf/1.40.0-wmf.7) - 10https://gerrit.wikimedia.org/r/850079 (https://phabricator.wikimedia.org/T298485) [12:28:59] jouncebot: nowandnext [12:28:59] No deployments scheduled for the next 0 hour(s) and 31 minute(s) [12:29:00] In 0 hour(s) and 31 minute(s): Mobileapps/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221027T1300) [12:29:00] In 0 hour(s) and 31 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221027T1300) [12:29:18] (03CR) 10Ladsgroup: [C: 03+2] maintenance: Use $this->waitForReplication() [core] (wmf/1.40.0-wmf.7) - 10https://gerrit.wikimedia.org/r/850079 (https://phabricator.wikimedia.org/T298485) (owner: 10Ladsgroup) [12:29:28] (03CR) 10Nikerabbit: [V: 03+2] Localisation updates from https://translatewiki.net. [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/850105 (owner: 10L10n-bot) [12:30:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2021', diff saved to https://phabricator.wikimedia.org/P36792 and previous config saved to /var/cache/conftool/dbconfig/20221027-123057-ladsgroup.json [12:32:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116', diff saved to https://phabricator.wikimedia.org/P36793 and previous config saved to /var/cache/conftool/dbconfig/20221027-123224-ladsgroup.json [12:35:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312', diff saved to https://phabricator.wikimedia.org/P36794 and previous config saved to /var/cache/conftool/dbconfig/20221027-123513-marostegui.json [12:41:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P36795 and previous config saved to /var/cache/conftool/dbconfig/20221027-124104-ladsgroup.json [12:42:02] RECOVERY - Check systemd state on cp5007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:42:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1022 (T321312)', diff saved to https://phabricator.wikimedia.org/P36796 and previous config saved to /var/cache/conftool/dbconfig/20221027-124255-ladsgroup.json [12:44:13] (03CR) 10CI reject: [V: 04-1] maintenance: Use $this->waitForReplication() [core] (wmf/1.40.0-wmf.7) - 10https://gerrit.wikimedia.org/r/850079 (https://phabricator.wikimedia.org/T298485) (owner: 10Ladsgroup) [12:44:14] (KubernetesAPILatency) firing: (13) High Kubernetes API latency (LIST certificates) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:45:33] (03Merged) 10jenkins-bot: maintenance: Use $this->waitForReplication() [core] (wmf/1.40.0-wmf.7) - 10https://gerrit.wikimedia.org/r/850079 (https://phabricator.wikimedia.org/T298485) (owner: 10Ladsgroup) [12:46:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2021', diff saved to https://phabricator.wikimedia.org/P36797 and previous config saved to /var/cache/conftool/dbconfig/20221027-124603-ladsgroup.json [12:47:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116 (T318950)', diff saved to https://phabricator.wikimedia.org/P36798 and previous config saved to /var/cache/conftool/dbconfig/20221027-124731-ladsgroup.json [12:47:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2130.codfw.wmnet with reason: Maintenance [12:47:39] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [12:47:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2130.codfw.wmnet with reason: Maintenance [12:47:48] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:850079|maintenance: Use $this->waitForReplication() (T298485)]] [12:47:51] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [core] (wmf/1.40.0-wmf.7) - 10https://gerrit.wikimedia.org/r/850079 (https://phabricator.wikimedia.org/T298485) (owner: 10Ladsgroup) [12:47:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2130 (T318950)', diff saved to https://phabricator.wikimedia.org/P36799 and previous config saved to /var/cache/conftool/dbconfig/20221027-124752-ladsgroup.json [12:47:56] T298485: MW scripts should reload the database config - https://phabricator.wikimedia.org/T298485 [12:48:09] !log ladsgroup@deploy1002 ladsgroup and ladsgroup: Backport for [[gerrit:850079|maintenance: Use $this->waitForReplication() (T298485)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet [12:48:20] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [12:49:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [12:49:10] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [12:49:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [12:50:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130 (T318950)', diff saved to https://phabricator.wikimedia.org/P36800 and previous config saved to /var/cache/conftool/dbconfig/20221027-125002-ladsgroup.json [12:50:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312 (T321123)', diff saved to https://phabricator.wikimedia.org/P36801 and previous config saved to /var/cache/conftool/dbconfig/20221027-125020-marostegui.json [12:50:22] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2148.codfw.wmnet with reason: Maintenance [12:50:26] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [12:50:35] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2148.codfw.wmnet with reason: Maintenance [12:50:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2148 (T321123)', diff saved to https://phabricator.wikimedia.org/P36802 and previous config saved to /var/cache/conftool/dbconfig/20221027-125042-marostegui.json [12:52:16] PROBLEM - Check systemd state on cp5007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:52:29] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:850079|maintenance: Use $this->waitForReplication() (T298485)]] (duration: 04m 40s) [12:53:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T321123)', diff saved to https://phabricator.wikimedia.org/P36803 and previous config saved to /var/cache/conftool/dbconfig/20221027-125307-marostegui.json [12:54:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2158.codfw.wmnet with reason: Maintenance [12:54:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2158.codfw.wmnet with reason: Maintenance [12:54:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance [12:54:50] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance [12:54:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2158 (T318605)', diff saved to https://phabricator.wikimedia.org/P36804 and previous config saved to /var/cache/conftool/dbconfig/20221027-125456-ladsgroup.json [12:55:04] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [12:55:28] (03PS2) 10Vgutierrez: trafficserver: Clean up after ATS 9.x upgrade [puppet] - 10https://gerrit.wikimedia.org/r/850087 (https://phabricator.wikimedia.org/T321776) [12:56:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P36805 and previous config saved to /var/cache/conftool/dbconfig/20221027-125610-ladsgroup.json [12:58:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1022', diff saved to https://phabricator.wikimedia.org/P36806 and previous config saved to /var/cache/conftool/dbconfig/20221027-125801-ladsgroup.json [13:00:05] Deploy window Mobileapps/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221027T1300) [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: Dear deployers, time to do the UTC afternoon backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221027T1300). [13:00:05] Sohom_Datta, WMDE-Fisch, and koi: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:13] o/ [13:00:19] * urbanecm waves [13:00:19] o/ [13:00:28] (03PS1) 10Muehlenhoff: Switch idp.w.o to idp2002 [dns] - 10https://gerrit.wikimedia.org/r/850148 [13:00:32] o/ [13:00:34] Lucas_WMDE: will you deploy, or should i? [13:00:43] I’m fine either way :) [13:00:57] tbh i prefer someone else deploying today, currently in a middle of something else [13:01:01] ok, I can do it [13:01:02] (03CR) 10Muehlenhoff: [C: 03+2] debian: Add bookworm [puppet] - 10https://gerrit.wikimedia.org/r/850092 (https://phabricator.wikimedia.org/T321783) (owner: 10Muehlenhoff) [13:01:08] thanks [13:01:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2021 (T321312)', diff saved to https://phabricator.wikimedia.org/P36807 and previous config saved to /var/cache/conftool/dbconfig/20221027-130110-ladsgroup.json [13:01:16] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es2022.codfw.wmnet with reason: Maintenance [13:01:29] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2022.codfw.wmnet with reason: Maintenance [13:01:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling es2022 (T321312)', diff saved to https://phabricator.wikimedia.org/P36808 and previous config saved to /var/cache/conftool/dbconfig/20221027-130135-ladsgroup.json [13:02:25] (03CR) 10JMeybohm: [C: 03+1] "🎉 thanks for doing this!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/842819 (https://phabricator.wikimedia.org/T316348) (owner: 10Clément Goubert) [13:03:46] !log depool cp5007 [13:03:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:45] (03PS4) 10Lucas Werkmeister (WMDE): Enable source links on Translation ns on bnwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849672 (https://phabricator.wikimedia.org/T53980) (owner: 10Bodhisattwa) [13:04:53] o/ [13:05:02] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849672 (https://phabricator.wikimedia.org/T53980) (owner: 10Bodhisattwa) [13:05:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130', diff saved to https://phabricator.wikimedia.org/P36809 and previous config saved to /var/cache/conftool/dbconfig/20221027-130509-ladsgroup.json [13:06:03] (03Merged) 10jenkins-bot: Enable source links on Translation ns on bnwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849672 (https://phabricator.wikimedia.org/T53980) (owner: 10Bodhisattwa) [13:06:19] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:849672|Enable source links on Translation ns on bnwikisource (T53980)]] [13:06:24] T53980: Source tab not showing up in the Translation namespace - https://phabricator.wikimedia.org/T53980 [13:06:38] RECOVERY - Check systemd state on cp5007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:06:38] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde and bodhisattwa: Backport for [[gerrit:849672|Enable source links on Translation ns on bnwikisource (T53980)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [13:06:39] (03CR) 10Muehlenhoff: check_systemd_state: consume exclusion list (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/849929 (https://phabricator.wikimedia.org/T303253) (owner: 10Giuseppe Lavagetto) [13:07:23] Sohom_Datta: can you test the change on mwdebug? [13:07:55] Yep, I can see the changes on mwdebug, works fine :) [13:08:00] yay [13:08:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P36810 and previous config saved to /var/cache/conftool/dbconfig/20221027-130814-marostegui.json [13:08:16] RECOVERY - Disk space on cp5007 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=cp5007&var-datasource=eqsin+prometheus/ops [13:09:08] (03CR) 10Lucas Werkmeister (WMDE): [C: 04-1] Enable show nearby feature on de.wikivoyage (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842353 (https://phabricator.wikimedia.org/T320692) (owner: 10WMDE-Fisch) [13:09:56] Whoops [13:10:03] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:10:20] PROBLEM - Confd vcl based reload on cp5007 is CRITICAL: reload-vcl failed to run since 0h, 6 minutes. https://wikitech.wikimedia.org/wiki/Varnish [13:10:32] RECOVERY - Check systemd state on idp-test1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:11:04] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:11:05] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:11:16] PROBLEM - SSH on mw1332.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:11:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T318950)', diff saved to https://phabricator.wikimedia.org/P36811 and previous config saved to /var/cache/conftool/dbconfig/20221027-131117-ladsgroup.json [13:11:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1198.eqiad.wmnet with reason: Maintenance [13:11:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1198.eqiad.wmnet with reason: Maintenance [13:11:23] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [13:11:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1198 (T318950)', diff saved to https://phabricator.wikimedia.org/P36812 and previous config saved to /var/cache/conftool/dbconfig/20221027-131127-ladsgroup.json [13:11:58] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:11:59] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:849672|Enable source links on Translation ns on bnwikisource (T53980)]] (duration: 05m 40s) [13:12:07] T53980: Source tab not showing up in the Translation namespace - https://phabricator.wikimedia.org/T53980 [13:12:09] !log pool cp5007 [13:12:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:24] RECOVERY - Confd vcl based reload on cp5007 is OK: reload-vcl successfully ran 0h, 0 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [13:13:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1022', diff saved to https://phabricator.wikimedia.org/P36813 and previous config saved to /var/cache/conftool/dbconfig/20221027-131308-ladsgroup.json [13:13:16] (03PS4) 10WMDE-Fisch: Enable show nearby feature on de.wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842353 (https://phabricator.wikimedia.org/T320692) [13:14:06] skipping WMDE-Fisch for a second and continuing with koi [13:14:10] oh, nevermind [13:14:11] (03CR) 10David Caro: [C: 03+1] "I actually use codesearch :)" [puppet] - 10https://gerrit.wikimedia.org/r/838125 (https://phabricator.wikimedia.org/T318704) (owner: 10Arturo Borrero Gonzalez) [13:14:15] :D [13:14:16] Lucas_WMDE: Updated [13:14:19] :-) [13:14:45] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Enable show nearby feature on de.wikivoyage (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842353 (https://phabricator.wikimedia.org/T320692) (owner: 10WMDE-Fisch) [13:14:50] (03PS5) 10Lucas Werkmeister (WMDE): Enable show nearby feature on de.wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842353 (https://phabricator.wikimedia.org/T320692) (owner: 10WMDE-Fisch) [13:14:58] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842353 (https://phabricator.wikimedia.org/T320692) (owner: 10WMDE-Fisch) [13:15:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T318950)', diff saved to https://phabricator.wikimedia.org/P36814 and previous config saved to /var/cache/conftool/dbconfig/20221027-131524-ladsgroup.json [13:16:16] (03Merged) 10jenkins-bot: Enable show nearby feature on de.wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/842353 (https://phabricator.wikimedia.org/T320692) (owner: 10WMDE-Fisch) [13:16:30] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:842353|Enable show nearby feature on de.wikivoyage (T320692)]] [13:16:36] T320692: Disable Wikivoyage nearby and enable Show Nearby on de.wikivoyage - https://phabricator.wikimedia.org/T320692 [13:16:49] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde and wmde-fisch: Backport for [[gerrit:842353|Enable show nearby feature on de.wikivoyage (T320692)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet [13:17:51] WMDE-Fisch: can you check that it’s working on mwdebug? [13:18:04] Lucas_WMDE: Tested on mwdebug, works fine. Please go on :-) [13:18:09] yay [13:18:33] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "Matches the default in PageTriage’s extension.json." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850106 (https://phabricator.wikimedia.org/T310974) (owner: 10Stang) [13:19:43] Hi Lucas_WMDE, TheresNoTime said this patch is not really testable [13:19:52] yeah, makes sense [13:19:58] and just need to " i.e. check https://grafana-rw.wikimedia.org/d/GDZR_4IVz/pagetriage-debugging?orgId=1&from=now-7d&to=now&refresh=1m and make sure the NOINDEX graph doesn't dramatically drop/rise" [13:20:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130', diff saved to https://phabricator.wikimedia.org/P36815 and previous config saved to /var/cache/conftool/dbconfig/20221027-132016-ladsgroup.json [13:20:58] I assume that’s something that would happen over the course of days rather than minutes [13:21:01] 10SRE, 10Traffic: Upgrade HAProxy on cp nodes to 2.6.x LTS - https://phabricator.wikimedia.org/T321775 (10CDanis) [13:21:02] (as articles slowly get re-parsed) [13:21:09] (03CR) 10Vgutierrez: "https://puppet-compiler.wmflabs.org/pcc-worker1003/37804/ errors for deployment-cache-text07 are expected. We need to update the hieradata" [puppet] - 10https://gerrit.wikimedia.org/r/850087 (https://phabricator.wikimedia.org/T321776) (owner: 10Vgutierrez) [13:21:14] (03PS2) 10Muehlenhoff: wmcs::nfs: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/842758 (https://phabricator.wikimedia.org/T308013) [13:21:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2022 (T321312)', diff saved to https://phabricator.wikimedia.org/P36816 and previous config saved to /var/cache/conftool/dbconfig/20221027-132148-ladsgroup.json [13:22:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:22:06] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:842353|Enable show nearby feature on de.wikivoyage (T320692)]] (duration: 05m 35s) [13:22:15] T320692: Disable Wikivoyage nearby and enable Show Nearby on de.wikivoyage - https://phabricator.wikimedia.org/T320692 [13:22:32] (03PS2) 10Lucas Werkmeister (WMDE): Define a default value for wgPageTriageMaxAge [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850106 (https://phabricator.wikimedia.org/T310974) (owner: 10Stang) [13:22:40] (good point ref. cache/re-parse Lucas_WMDE) [13:22:43] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850106 (https://phabricator.wikimedia.org/T310974) (owner: 10Stang) [13:23:02] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:23:03] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:23:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P36817 and previous config saved to /var/cache/conftool/dbconfig/20221027-132320-marostegui.json [13:23:36] RECOVERY - Check systemd state on wcqs2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:23:38] (03Merged) 10jenkins-bot: Define a default value for wgPageTriageMaxAge [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850106 (https://phabricator.wikimedia.org/T310974) (owner: 10Stang) [13:23:50] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:850106|Define a default value for wgPageTriageMaxAge (T310974)]] [13:23:57] T310974: Extend PageTriageMaxAge (noindex) for unpatrolled articles at enwiki - https://phabricator.wikimedia.org/T310974 [13:23:59] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:24:09] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde and stang: Backport for [[gerrit:850106|Define a default value for wgPageTriageMaxAge (T310974)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [13:24:30] RECOVERY - SSH on mw1310.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:24:52] I purged two articles on mwdebug and they didn’t get noindex [13:25:07] I don’t know PageTriage enough to find pages that should have (and keep) noindex [13:25:23] (all should stay the same with that config variable) [13:25:26] I’ll just continue with the sync [13:25:32] ack, sounds good [13:26:37] tbh I’m not sure this change is actually needed – maybe it would’ve been enough to use 'default' => null, 'enwiki' => 0 in the other change? [13:26:49] (null should mean that the setting isn’t added at all on most wiki, leaving the extension default in place) [13:26:59] but maybe it’s better to have the default explicit [13:27:02] (03CR) 10Jbond: doc: add parameters for gitlab_runner and contint hosts, allow them (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/849699 (https://phabricator.wikimedia.org/T321629) (owner: 10Dzahn) [13:27:06] (03PS9) 10Jbond: doc: add parameters for gitlab_runner and contint hosts, allow them [puppet] - 10https://gerrit.wikimedia.org/r/849699 (https://phabricator.wikimedia.org/T321629) (owner: 10Dzahn) [13:27:10] (03PS1) 10Jbond: O:docker_registry_ha::registry: move defaults to common section [puppet] - 10https://gerrit.wikimedia.org/r/850153 [13:27:31] Lucas_WMDE: hm, did think of that, but I personally wanted to suggest that we explicitly set it/make it known [13:27:40] ok :) [13:27:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T318605)', diff saved to https://phabricator.wikimedia.org/P36818 and previous config saved to /var/cache/conftool/dbconfig/20221027-132743-ladsgroup.json [13:27:49] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37805/console" [puppet] - 10https://gerrit.wikimedia.org/r/849699 (https://phabricator.wikimedia.org/T321629) (owner: 10Dzahn) [13:27:53] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [13:28:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1022 (T321312)', diff saved to https://phabricator.wikimedia.org/P36819 and previous config saved to /var/cache/conftool/dbconfig/20221027-132814-ladsgroup.json [13:28:55] (03CR) 10Jbond: [C: 03+1] doc: add parameters for gitlab_runner and contint hosts, allow them (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/849699 (https://phabricator.wikimedia.org/T321629) (owner: 10Dzahn) [13:29:02] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:29:24] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:850106|Define a default value for wgPageTriageMaxAge (T310974)]] (duration: 05m 33s) [13:29:30] T310974: Extend PageTriageMaxAge (noindex) for unpatrolled articles at enwiki - https://phabricator.wikimedia.org/T310974 [13:29:46] PROBLEM - Check systemd state on wcqs2003 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:29:49] anything else to deploy? [13:30:00] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:30:02] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:30:02] (I see the “real” enwiki PageTriage change is scheduled for next Monday) [13:30:18] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37806/console" [puppet] - 10https://gerrit.wikimedia.org/r/850153 (owner: 10Jbond) [13:30:21] I have a labs patch but that just needs to be merged and then a git fetch to be nice I guess. [13:30:25] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/850118 [13:30:30] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Q1:rack/setup/install druid10[09-11] - https://phabricator.wikimedia.org/T314335 (10Cmjohnson) The mgmt links are still not working, The DNS is correct but I am unable to ping the servers. [13:30:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P36820 and previous config saved to /var/cache/conftool/dbconfig/20221027-133031-ladsgroup.json [13:30:51] (03PS2) 10Lucas Werkmeister (WMDE): [beta] Enable Kartographer show nearby clustering [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850118 (https://phabricator.wikimedia.org/T321307) (owner: 10WMDE-Fisch) [13:30:56] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:30:58] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850118 (https://phabricator.wikimedia.org/T321307) (owner: 10WMDE-Fisch) [13:31:13] ty ;-) [13:31:13] I put it into scap backport, IIUC it’ll decide to skip the sync on its own [13:31:29] (03CR) 10jenkins-bot: doc: add parameters for gitlab_runner and contint hosts, allow them [puppet] - 10https://gerrit.wikimedia.org/r/849699 (https://phabricator.wikimedia.org/T321629) (owner: 10Dzahn) [13:31:43] (03Merged) 10jenkins-bot: [beta] Enable Kartographer show nearby clustering [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850118 (https://phabricator.wikimedia.org/T321307) (owner: 10WMDE-Fisch) [13:32:16] yup, it’s done already [13:32:34] it didn’t even log anything [13:32:58] !log UTC afternoon backport+config window done [13:33:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:04] (03PS1) 10Filippo Giunchedi: smokeping: add ensure parameter, set to present [puppet] - 10https://gerrit.wikimedia.org/r/850154 (https://phabricator.wikimedia.org/T169860) [13:33:06] (03PS1) 10Filippo Giunchedi: profile: absent smokeping [puppet] - 10https://gerrit.wikimedia.org/r/850155 (https://phabricator.wikimedia.org/T169860) [13:33:08] (03PS1) 10Filippo Giunchedi: smokeping: remove module and profile [puppet] - 10https://gerrit.wikimedia.org/r/850156 (https://phabricator.wikimedia.org/T169860) [13:33:10] (03PS1) 10Filippo Giunchedi: smokeping: remove ancillary data [puppet] - 10https://gerrit.wikimedia.org/r/850157 (https://phabricator.wikimedia.org/T169860) [13:33:38] (03CR) 10CI reject: [V: 04-1] smokeping: add ensure parameter, set to present [puppet] - 10https://gerrit.wikimedia.org/r/850154 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [13:35:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130 (T318950)', diff saved to https://phabricator.wikimedia.org/P36821 and previous config saved to /var/cache/conftool/dbconfig/20221027-133522-ladsgroup.json [13:35:24] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2141.codfw.wmnet with reason: Maintenance [13:35:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2141.codfw.wmnet with reason: Maintenance [13:35:29] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [13:35:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2145.codfw.wmnet with reason: Maintenance [13:35:45] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2145.codfw.wmnet with reason: Maintenance [13:35:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2145 (T318950)', diff saved to https://phabricator.wikimedia.org/P36822 and previous config saved to /var/cache/conftool/dbconfig/20221027-133551-ladsgroup.json [13:36:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:36:21] (03PS1) 10Ssingh: cp4042: update site.pp and related configs for cp (text) role [puppet] - 10https://gerrit.wikimedia.org/r/850158 (https://phabricator.wikimedia.org/T317244) [13:36:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2022', diff saved to https://phabricator.wikimedia.org/P36823 and previous config saved to /var/cache/conftool/dbconfig/20221027-133654-ladsgroup.json [13:36:56] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:36:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:37:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:38:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T318950)', diff saved to https://phabricator.wikimedia.org/P36824 and previous config saved to /var/cache/conftool/dbconfig/20221027-133801-ladsgroup.json [13:38:26] (03CR) 10Ssingh: [C: 03+2] cp4042: update site.pp and related configs for cp (text) role [puppet] - 10https://gerrit.wikimedia.org/r/850158 (https://phabricator.wikimedia.org/T317244) (owner: 10Ssingh) [13:38:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T321123)', diff saved to https://phabricator.wikimedia.org/P36825 and previous config saved to /var/cache/conftool/dbconfig/20221027-133827-marostegui.json [13:38:29] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2170.codfw.wmnet with reason: Maintenance [13:38:32] (03CR) 10Btullis: [C: 03+1] "LGTM, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/842763 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [13:38:33] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [13:38:42] (03PS2) 10Filippo Giunchedi: smokeping: add ensure parameter, set to present [puppet] - 10https://gerrit.wikimedia.org/r/850154 (https://phabricator.wikimedia.org/T169860) [13:38:43] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2170.codfw.wmnet with reason: Maintenance [13:38:44] (03PS2) 10Filippo Giunchedi: profile: absent smokeping [puppet] - 10https://gerrit.wikimedia.org/r/850155 (https://phabricator.wikimedia.org/T169860) [13:38:46] (03PS2) 10Filippo Giunchedi: smokeping: remove module and profile [puppet] - 10https://gerrit.wikimedia.org/r/850156 (https://phabricator.wikimedia.org/T169860) [13:38:48] (03PS2) 10Filippo Giunchedi: smokeping: remove ancillary data [puppet] - 10https://gerrit.wikimedia.org/r/850157 (https://phabricator.wikimedia.org/T169860) [13:38:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2170:3312 (T321123)', diff saved to https://phabricator.wikimedia.org/P36826 and previous config saved to /var/cache/conftool/dbconfig/20221027-133848-marostegui.json [13:39:05] (03CR) 10Btullis: [C: 03+1] "Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/832258 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [13:39:12] (03CR) 10Muehlenhoff: [C: 03+2] wmcs::nfs: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/842758 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [13:39:23] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4042.ulsfo.wmnet with OS buster [13:39:52] (03CR) 10Btullis: [C: 03+1] matomo/piwik: Enable profile::auto_restarts::service for Apache [puppet] - 10https://gerrit.wikimedia.org/r/832483 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [13:40:28] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [13:40:56] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:41:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312 (T321123)', diff saved to https://phabricator.wikimedia.org/P36827 and previous config saved to /var/cache/conftool/dbconfig/20221027-134115-marostegui.json [13:42:38] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Duplicate monitoring for systemd::timer::job - https://phabricator.wikimedia.org/T303253 (10fgiunchedi) >>! In T303253#8348894, @jcrespo wrote: >> we want to move "systemd unit failed" off Icinga and onto AM too > > This is higher level, a... [13:42:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P36828 and previous config saved to /var/cache/conftool/dbconfig/20221027-134251-ladsgroup.json [13:43:00] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:43:46] (03CR) 10Muehlenhoff: [C: 03+2] matomo/piwik: Enable profile::auto_restarts::service for Apache [puppet] - 10https://gerrit.wikimedia.org/r/832483 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [13:44:06] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:45:04] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [13:45:06] (03CR) 10Elukey: [WIP] wikilabels: move Postgres DB to its own (non-wmcs) role (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/849095 (https://phabricator.wikimedia.org/T307389) (owner: 10Klausman) [13:45:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P36829 and previous config saved to /var/cache/conftool/dbconfig/20221027-134537-ladsgroup.json [13:45:49] (03CR) 10Awight: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/850160 (https://phabricator.wikimedia.org/T321722) (owner: 10Awight) [13:46:13] (03CR) 10Clément Goubert: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/850095 (https://phabricator.wikimedia.org/T321786) (owner: 10Clément Goubert) [13:46:13] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:46:48] (03CR) 10Muehlenhoff: [C: 03+2] Switch idp.w.o to idp2002 [dns] - 10https://gerrit.wikimedia.org/r/850148 (owner: 10Muehlenhoff) [13:50:41] (03CR) 10WMDE-Fisch: [C: 03+1] Invite some of WMDE Tech Wishes team to poke around maps instances [puppet] - 10https://gerrit.wikimedia.org/r/850160 (https://phabricator.wikimedia.org/T321722) (owner: 10Awight) [13:52:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2022', diff saved to https://phabricator.wikimedia.org/P36830 and previous config saved to /var/cache/conftool/dbconfig/20221027-135201-ladsgroup.json [13:52:16] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [13:53:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P36831 and previous config saved to /var/cache/conftool/dbconfig/20221027-135307-ladsgroup.json [13:53:40] PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:55:35] (03CR) 10Herron: [C: 03+2] "Thanks for the review and update!" [puppet] - 10https://gerrit.wikimedia.org/r/845033 (https://phabricator.wikimedia.org/T321086) (owner: 10Herron) [13:55:44] RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:56:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312', diff saved to https://phabricator.wikimedia.org/P36832 and previous config saved to /var/cache/conftool/dbconfig/20221027-135621-marostegui.json [13:57:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P36833 and previous config saved to /var/cache/conftool/dbconfig/20221027-135757-ladsgroup.json [13:58:59] (KubernetesAPILatency) firing: (13) High Kubernetes API latency (LIST certificates) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:59:55] (03CR) 10Clément Goubert: [C: 03+2] Adapts specs and tests to kubeconform only [deployment-charts] - 10https://gerrit.wikimedia.org/r/842819 (https://phabricator.wikimedia.org/T316348) (owner: 10Clément Goubert) [14:00:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T318950)', diff saved to https://phabricator.wikimedia.org/P36834 and previous config saved to /var/cache/conftool/dbconfig/20221027-140043-ladsgroup.json [14:00:46] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [14:00:50] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [14:00:59] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [14:01:02] (03PS8) 10Klausman: [WIP] wikilabels: move Postgres DB to its own (non-wmcs) role [puppet] - 10https://gerrit.wikimedia.org/r/849095 (https://phabricator.wikimedia.org/T307389) [14:01:04] (03CR) 10Muehlenhoff: [C: 03+2] statistics : Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/842763 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [14:01:22] (03CR) 10Klausman: [WIP] wikilabels: move Postgres DB to its own (non-wmcs) role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/849095 (https://phabricator.wikimedia.org/T307389) (owner: 10Klausman) [14:01:41] (03PS1) 10Vgutierrez: Add enterprisewikimedia.com as a ncredir domain [dns] - 10https://gerrit.wikimedia.org/r/850167 (https://phabricator.wikimedia.org/T321804) [14:02:00] (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37807/console" [puppet] - 10https://gerrit.wikimedia.org/r/849095 (https://phabricator.wikimedia.org/T307389) (owner: 10Klausman) [14:02:27] (03CR) 10Muehlenhoff: [C: 03+2] Enable profile::auto_restarts::service for nginx on archiva/proxy [puppet] - 10https://gerrit.wikimedia.org/r/832258 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [14:02:45] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for appledora - https://phabricator.wikimedia.org/T321086 (10herron) 05Open→03Resolved a:03herron The requested access has been granted and will fully propagate within 30 minutes. Transitioning this to r... [14:02:55] (03PS2) 10Vgutierrez: Add wikimediaenteprise.com as a ncredir domain [dns] - 10https://gerrit.wikimedia.org/r/850167 (https://phabricator.wikimedia.org/T321804) [14:03:42] (03Merged) 10jenkins-bot: Adapts specs and tests to kubeconform only [deployment-charts] - 10https://gerrit.wikimedia.org/r/842819 (https://phabricator.wikimedia.org/T316348) (owner: 10Clément Goubert) [14:03:59] (KubernetesAPILatency) firing: (13) High Kubernetes API latency (LIST certificates) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:04:31] (03CR) 10Clément Goubert: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/849501 (https://phabricator.wikimedia.org/T321201) (owner: 10Clément Goubert) [14:05:02] PROBLEM - MegaRAID on an-worker1083 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [14:05:05] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4042.ulsfo.wmnet with reason: host reimage [14:07:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es2022 (T321312)', diff saved to https://phabricator.wikimedia.org/P36835 and previous config saved to /var/cache/conftool/dbconfig/20221027-140708-ladsgroup.json [14:08:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P36836 and previous config saved to /var/cache/conftool/dbconfig/20221027-140814-ladsgroup.json [14:08:38] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4042.ulsfo.wmnet with reason: host reimage [14:08:59] (KubernetesAPILatency) firing: (15) High Kubernetes API latency (LIST certificates) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:10:30] (03PS2) 10Muehlenhoff: dumps::generation: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/842760 (https://phabricator.wikimedia.org/T308013) [14:11:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312', diff saved to https://phabricator.wikimedia.org/P36837 and previous config saved to /var/cache/conftool/dbconfig/20221027-141128-marostegui.json [14:13:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T318605)', diff saved to https://phabricator.wikimedia.org/P36838 and previous config saved to /var/cache/conftool/dbconfig/20221027-141304-ladsgroup.json [14:13:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2169.codfw.wmnet with reason: Maintenance [14:13:10] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [14:13:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2169.codfw.wmnet with reason: Maintenance [14:13:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2169:3316 (T318605)', diff saved to https://phabricator.wikimedia.org/P36839 and previous config saved to /var/cache/conftool/dbconfig/20221027-141326-ladsgroup.json [14:13:59] (KubernetesAPILatency) firing: (15) High Kubernetes API latency (LIST certificates) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:14:53] (03CR) 10Muehlenhoff: [C: 03+2] dumps::generation: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/842760 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [14:16:04] (03PS2) 10Muehlenhoff: kafka: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/842759 (https://phabricator.wikimedia.org/T308013) [14:20:41] (03PS9) 10Arturo Borrero Gonzalez: cloudgw2003-dev: give proper role and take over cloudgw2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/838125 (https://phabricator.wikimedia.org/T318704) [14:22:29] (03CR) 10Muehlenhoff: [C: 03+2] kafka: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/842759 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [14:23:02] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw2003-dev: give proper role and take over cloudgw2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/838125 (https://phabricator.wikimedia.org/T318704) (owner: 10Arturo Borrero Gonzalez) [14:23:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T318950)', diff saved to https://phabricator.wikimedia.org/P36840 and previous config saved to /var/cache/conftool/dbconfig/20221027-142320-ladsgroup.json [14:23:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2146.codfw.wmnet with reason: Maintenance [14:23:27] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [14:23:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2146.codfw.wmnet with reason: Maintenance [14:23:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2146 (T318950)', diff saved to https://phabricator.wikimedia.org/P36841 and previous config saved to /var/cache/conftool/dbconfig/20221027-142342-ladsgroup.json [14:24:10] RECOVERY - Check systemd state on wcqs2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:24:16] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Assign SPDX headers to puppet.git - https://phabricator.wikimedia.org/T308013 (10Daimona) [14:24:58] !log aborrero@cumin2002 START - Cookbook sre.hosts.reimage for host cloudgw2001-dev.codfw.wmnet with OS bullseye [14:25:33] !log aborrero@cumin2002 START - Cookbook sre.hosts.reimage for host cloudgw2003-dev.codfw.wmnet with OS bullseye [14:25:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T318950)', diff saved to https://phabricator.wikimedia.org/P36842 and previous config saved to /var/cache/conftool/dbconfig/20221027-142552-ladsgroup.json [14:26:27] (03PS1) 10Muehlenhoff: Add Daimona to CONTRIBUTORS [puppet] - 10https://gerrit.wikimedia.org/r/850170 [14:26:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312 (T321123)', diff saved to https://phabricator.wikimedia.org/P36843 and previous config saved to /var/cache/conftool/dbconfig/20221027-142634-marostegui.json [14:26:36] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2175.codfw.wmnet with reason: Maintenance [14:26:41] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [14:26:50] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2175.codfw.wmnet with reason: Maintenance [14:26:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2175 (T321123)', diff saved to https://phabricator.wikimedia.org/P36844 and previous config saved to /var/cache/conftool/dbconfig/20221027-142656-marostegui.json [14:27:56] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:28:34] (03CR) 10Muehlenhoff: [C: 03+2] Add Daimona to CONTRIBUTORS [puppet] - 10https://gerrit.wikimedia.org/r/850170 (owner: 10Muehlenhoff) [14:28:46] PROBLEM - Check systemd state on wcqs2003 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:29:03] (ProbeDown) firing: (2) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:30:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T321123)', diff saved to https://phabricator.wikimedia.org/P36845 and previous config saved to /var/cache/conftool/dbconfig/20221027-143045-marostegui.json [14:30:48] (03CR) 10Filippo Giunchedi: [C: 03+2] dispatch: introduce profile [puppet] - 10https://gerrit.wikimedia.org/r/824449 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi) [14:30:53] (03PS11) 10Filippo Giunchedi: dispatch: introduce profile [puppet] - 10https://gerrit.wikimedia.org/r/824449 (https://phabricator.wikimedia.org/T313229) [14:31:50] (03PS1) 10Jbond: R:rsync::manifests::server::module: add type validation [puppet] - 10https://gerrit.wikimedia.org/r/850171 [14:31:52] (03PS1) 10Jbond: R:rsync::manifests::server::module: Strengthen types [puppet] - 10https://gerrit.wikimedia.org/r/850172 [14:31:54] (03PS1) 10Jbond: rsync::server::module: drop auto_ferm_ipv6 parameter [puppet] - 10https://gerrit.wikimedia.org/r/850173 [14:32:50] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for appledora - https://phabricator.wikimedia.org/T321086 (10Isaac) Thanks all as always for the quick and helpful support in granting access! [14:33:58] (03CR) 10CI reject: [V: 04-1] R:rsync::manifests::server::module: add type validation [puppet] - 10https://gerrit.wikimedia.org/r/850171 (owner: 10Jbond) [14:34:03] (ProbeDown) resolved: (2) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:34:12] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4042.ulsfo.wmnet with OS buster [14:34:42] (03CR) 10CI reject: [V: 04-1] R:rsync::manifests::server::module: Strengthen types [puppet] - 10https://gerrit.wikimedia.org/r/850172 (owner: 10Jbond) [14:35:41] (03CR) 10CI reject: [V: 04-1] rsync::server::module: drop auto_ferm_ipv6 parameter [puppet] - 10https://gerrit.wikimedia.org/r/850173 (owner: 10Jbond) [14:35:45] RECOVERY - MegaRAID on an-worker1083 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [14:38:11] (03PS9) 10Klausman: wikilabels: move Postgres DB to its own (non-wmcs) role [puppet] - 10https://gerrit.wikimedia.org/r/849095 (https://phabricator.wikimedia.org/T307389) [14:39:25] (03CR) 10Filippo Giunchedi: [C: 03+2] alerting_host: include dispatch profile [puppet] - 10https://gerrit.wikimedia.org/r/849021 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi) [14:39:27] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:39:32] (03PS5) 10Filippo Giunchedi: alerting_host: include dispatch profile [puppet] - 10https://gerrit.wikimedia.org/r/849021 (https://phabricator.wikimedia.org/T313229) [14:40:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P36846 and previous config saved to /var/cache/conftool/dbconfig/20221027-144058-ladsgroup.json [14:41:20] !log aborrero@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudgw2003-dev.codfw.wmnet with reason: host reimage [14:41:43] (03CR) 10Ahmon Dancy: "Adding Jelto to make sure he's aware of what's going on in this area." [puppet] - 10https://gerrit.wikimedia.org/r/850153 (owner: 10Jbond) [14:43:25] (03PS1) 10Ssingh: cp4041: update site.pp and related configs for cp (text) role [puppet] - 10https://gerrit.wikimedia.org/r/850176 (https://phabricator.wikimedia.org/T317244) [14:45:04] !log aborrero@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudgw2003-dev.codfw.wmnet with reason: host reimage [14:45:13] (03CR) 10Ssingh: [C: 03+2] cp4041: update site.pp and related configs for cp (text) role [puppet] - 10https://gerrit.wikimedia.org/r/850176 (https://phabricator.wikimedia.org/T317244) (owner: 10Ssingh) [14:45:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P36847 and previous config saved to /var/cache/conftool/dbconfig/20221027-144551-marostegui.json [14:46:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316 (T318605)', diff saved to https://phabricator.wikimedia.org/P36848 and previous config saved to /var/cache/conftool/dbconfig/20221027-144602-ladsgroup.json [14:46:08] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [14:47:08] (03CR) 10Sergio Gimeno: [C: 03+1] [labs] GrowthExperiments: Use d3.js with new impact module [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850098 (https://phabricator.wikimedia.org/T318854) (owner: 10Kosta Harlan) [14:47:16] (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [14:48:05] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:48:42] !log installing twitter-bootstrap4 security updates [14:48:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:29] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4041.ulsfo.wmnet with OS buster [14:50:08] !log aborrero@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudgw2001-dev.codfw.wmnet with reason: host reimage [14:50:11] !log aborrero@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cloudgw2001-dev.codfw.wmnet with reason: host reimage [14:51:17] !log installing krb5 bugfix updates from Bullseye point release [14:51:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:39] RECOVERY - Check systemd state on wcqs2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:53:54] (03PS1) 10Filippo Giunchedi: hieradata: add dispatch db_hostname [puppet] - 10https://gerrit.wikimedia.org/r/850178 (https://phabricator.wikimedia.org/T313229) [14:55:08] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: add dispatch db_hostname [puppet] - 10https://gerrit.wikimedia.org/r/850178 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi) [14:55:15] (03PS2) 10Filippo Giunchedi: hieradata: add dispatch db_hostname [puppet] - 10https://gerrit.wikimedia.org/r/850178 (https://phabricator.wikimedia.org/T313229) [14:55:20] (03CR) 10Giuseppe Lavagetto: [C: 03+1] kubernetes: Rename mwdebug to mw-debug [puppet] - 10https://gerrit.wikimedia.org/r/849502 (https://phabricator.wikimedia.org/T321201) (owner: 10Clément Goubert) [14:55:44] (03CR) 10Filippo Giunchedi: [V: 03+2] hieradata: add dispatch db_hostname [puppet] - 10https://gerrit.wikimedia.org/r/850178 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi) [14:56:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P36849 and previous config saved to /var/cache/conftool/dbconfig/20221027-145604-ladsgroup.json [14:57:37] PROBLEM - Check systemd state on wcqs2003 is CRITICAL: CRITICAL - degraded: The following units failed: wcqs-updater.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:00:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P36850 and previous config saved to /var/cache/conftool/dbconfig/20221027-150058-marostegui.json [15:01:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316', diff saved to https://phabricator.wikimedia.org/P36851 and previous config saved to /var/cache/conftool/dbconfig/20221027-150108-ladsgroup.json [15:03:57] PROBLEM - Checks that the local airflow scheduler for airflow @analytics is working properly on an-launcher1002 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-analytics /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-launcher1002.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [15:03:59] (KubernetesAPILatency) firing: (13) High Kubernetes API latency (LIST certificates) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:04:07] (03CR) 10Giuseppe Lavagetto: [C: 03+1] mediawiki: Create new mw-debug deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/849501 (https://phabricator.wikimedia.org/T321201) (owner: 10Clément Goubert) [15:05:03] (ProbeDown) firing: (2) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:05:41] RECOVERY - Checks that the local airflow scheduler for airflow @analytics is working properly on an-launcher1002 is OK: OK: /usr/bin/env AIRFLOW_HOME=/srv/airflow-analytics /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-launcher1002.eqiad.wmnet succeeded https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [15:06:29] jouncebot: nowandnext [15:06:29] No deployments scheduled for the next 0 hour(s) and 53 minute(s) [15:06:29] In 0 hour(s) and 53 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221027T1600) [15:06:35] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.5 point update - https://phabricator.wikimedia.org/T317416 (10MoritzMuehlenhoff) [15:07:06] !log Switching k8s-experimental mwdebug service [15:07:10] PROBLEM - Check systemd state on ms-be1056 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:07:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:16] !log installing node-moment security updates [15:07:18] !log Pausing mwdebug k8s deployments [15:07:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:38] (03CR) 10Clément Goubert: [C: 03+2] mediawiki: Create new mw-debug deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/849501 (https://phabricator.wikimedia.org/T321201) (owner: 10Clément Goubert) [15:07:44] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:08:04] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [15:09:18] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: WARNING:urllib3.connectionpool:Retrying (Retry(total=0, connect=None, read=None, redirect=0, status=None)) after connection broken by ReadTimeoutError(HTTPSConnectionPool(host=citoid.svc.eqiad.wmnet, port=4003): Read timed out. (read timeout=15)): /?spec https://wikitech.wikimedia.org/wiki/Citoid [15:10:03] (ProbeDown) firing: (4) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:11:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T318950)', diff saved to https://phabricator.wikimedia.org/P36852 and previous config saved to /var/cache/conftool/dbconfig/20221027-151111-ladsgroup.json [15:11:12] RECOVERY - SSH on mw1332.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:11:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2153.codfw.wmnet with reason: Maintenance [15:11:18] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [15:11:19] !log aborrero@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudgw2001-dev.codfw.wmnet with OS bullseye [15:11:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2153.codfw.wmnet with reason: Maintenance [15:11:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2153 (T318950)', diff saved to https://phabricator.wikimedia.org/P36853 and previous config saved to /var/cache/conftool/dbconfig/20221027-151133-ladsgroup.json [15:11:54] (03Merged) 10jenkins-bot: mediawiki: Create new mw-debug deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/849501 (https://phabricator.wikimedia.org/T321201) (owner: 10Clément Goubert) [15:12:32] !log Silence ProbeDown instance="mwdebug:4444" for 1h [15:12:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:36] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.5 point update - https://phabricator.wikimedia.org/T317416 (10MoritzMuehlenhoff) [15:13:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T318950)', diff saved to https://phabricator.wikimedia.org/P36854 and previous config saved to /var/cache/conftool/dbconfig/20221027-151343-ladsgroup.json [15:13:59] (KubernetesAPILatency) firing: (15) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:14:36] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be1056 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:15:03] (ProbeDown) resolved: (4) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:15:48] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4041.ulsfo.wmnet with reason: host reimage [15:16:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T321123)', diff saved to https://phabricator.wikimedia.org/P36855 and previous config saved to /var/cache/conftool/dbconfig/20221027-151604-marostegui.json [15:16:10] T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis - https://phabricator.wikimedia.org/T321123 [15:16:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316', diff saved to https://phabricator.wikimedia.org/P36856 and previous config saved to /var/cache/conftool/dbconfig/20221027-151615-ladsgroup.json [15:17:03] (03PS2) 10Jbond: R:rsync::manifests::server::module: add type validation [puppet] - 10https://gerrit.wikimedia.org/r/850171 [15:17:17] (03PS3) 10Jbond: R:rsync::manifests::server::module: add type validation [puppet] - 10https://gerrit.wikimedia.org/r/850171 [15:18:13] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [15:18:39] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [15:18:50] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [15:19:08] (03CR) 10Clément Goubert: [C: 03+2] kubernetes: Rename mwdebug to mw-debug [puppet] - 10https://gerrit.wikimedia.org/r/849502 (https://phabricator.wikimedia.org/T321201) (owner: 10Clément Goubert) [15:19:12] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [15:19:30] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4041.ulsfo.wmnet with reason: host reimage [15:21:59] 10SRE, 10Data-Engineering-Operations, 10Data-Engineering-Planning, 10Mail: Change the analytics-alerts email alias to a mailman distribution list - https://phabricator.wikimedia.org/T315486 (10BTullis) After modifying the alias I also needed to set the following option in mailman. {F35641648,width=80%} [15:22:01] !log Unpausing mwdebug k8s deployments [15:22:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:32] (03PS7) 10Ottomata: Declare mediawiki.page_change stream in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849144 (https://phabricator.wikimedia.org/T311129) [15:22:36] 10SRE, 10Data-Engineering-Operations, 10Data-Engineering-Planning, 10Mail: Change the analytics-alerts email alias to a mailman distribution list - https://phabricator.wikimedia.org/T315486 (10BTullis) 05Open→03Resolved [15:23:30] !log k8s-experimental mwdebug service switched to new deployment mw-debug [15:23:31] RECOVERY - Check systemd state on wcqs2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:23:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:42] !log Removed silence ProbeDown instance="mwdebug:4444" [15:23:47] (03CR) 10CI reject: [V: 04-1] Declare mediawiki.page_change stream in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849144 (https://phabricator.wikimedia.org/T311129) (owner: 10Ottomata) [15:23:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:45] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 5 days, 0:00:00 on wcqs2002.codfw.wmnet with reason: data reload [15:26:12] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on wcqs2002.codfw.wmnet with reason: data reload [15:26:17] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 5 days, 0:00:00 on wcqs2003.codfw.wmnet with reason: data reload [15:26:33] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on wcqs2003.codfw.wmnet with reason: data reload [15:26:34] (03PS1) 10Clément Goubert: mw-debug: Remove old mwdebug deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/850184 (https://phabricator.wikimedia.org/T321201) [15:26:42] !log bking@cumin2002 START - Cookbook sre.hosts.remove-downtime for wcqs2002.codfw.wmnet [15:26:42] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for wcqs2002.codfw.wmnet [15:28:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P36857 and previous config saved to /var/cache/conftool/dbconfig/20221027-152849-ladsgroup.json [15:31:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316 (T318605)', diff saved to https://phabricator.wikimedia.org/P36858 and previous config saved to /var/cache/conftool/dbconfig/20221027-153121-ladsgroup.json [15:31:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2171.codfw.wmnet with reason: Maintenance [15:31:26] (03PS1) 10Btullis: Add a postgres user with our IPv6 network address [puppet] - 10https://gerrit.wikimedia.org/r/850185 (https://phabricator.wikimedia.org/T319440) [15:31:28] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [15:31:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2171.codfw.wmnet with reason: Maintenance [15:31:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2171:3316 (T318605)', diff saved to https://phabricator.wikimedia.org/P36859 and previous config saved to /var/cache/conftool/dbconfig/20221027-153143-ladsgroup.json [15:32:36] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37808/console" [puppet] - 10https://gerrit.wikimedia.org/r/850185 (https://phabricator.wikimedia.org/T319440) (owner: 10Btullis) [15:32:51] (03PS1) 10Ssingh: cp4050: update site.pp and related configs for cp (upload) role [puppet] - 10https://gerrit.wikimedia.org/r/850186 (https://phabricator.wikimedia.org/T317244) [15:34:03] (03CR) 10Btullis: [V: 03+1 C: 03+2] Add a postgres user with our IPv6 network address [puppet] - 10https://gerrit.wikimedia.org/r/850185 (https://phabricator.wikimedia.org/T319440) (owner: 10Btullis) [15:34:14] PROBLEM - Disk space on alert1001 is CRITICAL: DISK CRITICAL - /run/docker/netns/default is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=alert1001&var-datasource=eqiad+prometheus/ops [15:34:44] RECOVERY - Check systemd state on ms-be1056 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:38:27] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/849928 (https://phabricator.wikimedia.org/T303253) (owner: 10Giuseppe Lavagetto) [15:40:36] (03CR) 10Jbond: [C: 03+1] "lgtm minor optional nit" [puppet] - 10https://gerrit.wikimedia.org/r/849929 (https://phabricator.wikimedia.org/T303253) (owner: 10Giuseppe Lavagetto) [15:42:28] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4041.ulsfo.wmnet with OS buster [15:43:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P36860 and previous config saved to /var/cache/conftool/dbconfig/20221027-154356-ladsgroup.json [15:44:11] (03CR) 10Ssingh: [C: 03+2] cp4050: update site.pp and related configs for cp (upload) role [puppet] - 10https://gerrit.wikimedia.org/r/850186 (https://phabricator.wikimedia.org/T317244) (owner: 10Ssingh) [15:44:23] (03CR) 10Elukey: [C: 03+1] wikilabels: move Postgres DB to its own (non-wmcs) role [puppet] - 10https://gerrit.wikimedia.org/r/849095 (https://phabricator.wikimedia.org/T307389) (owner: 10Klausman) [15:45:24] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be1056 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:45:27] (03CR) 10Klausman: [C: 03+2] wikilabels: move Postgres DB to its own (non-wmcs) role [puppet] - 10https://gerrit.wikimedia.org/r/849095 (https://phabricator.wikimedia.org/T307389) (owner: 10Klausman) [15:45:28] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4050.ulsfo.wmnet with OS buster [15:46:57] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [15:47:02] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [15:53:06] (03CR) 10David Caro: [V: 03+1] p::toolforge:harbor: use distro docker for bullseye (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/848356 (https://phabricator.wikimedia.org/T316541) (owner: 10David Caro) [15:53:31] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: don't configure anything on base dataplace interface [puppet] - 10https://gerrit.wikimedia.org/r/850190 (https://phabricator.wikimedia.org/T319184) [15:53:59] (KubernetesAPILatency) firing: (14) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:54:56] (03CR) 10David Caro: [C: 03+2] p::ceph:mon: set permissions if mgr key parent dirs [puppet] - 10https://gerrit.wikimedia.org/r/849032 (https://phabricator.wikimedia.org/T321514) (owner: 10David Caro) [15:55:31] !log aborrero@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudgw2003-dev.codfw.wmnet with OS bullseye [15:55:35] (03PS1) 10Klausman: wikilables: fix wrong path for Postgres tuning.conf [puppet] - 10https://gerrit.wikimedia.org/r/850191 [15:56:28] (03CR) 10Andrew Bogott: [C: 03+1] cloudgw: don't configure anything on base dataplace interface [puppet] - 10https://gerrit.wikimedia.org/r/850190 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez) [15:56:57] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw: don't configure anything on base dataplace interface [puppet] - 10https://gerrit.wikimedia.org/r/850190 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez) [15:57:01] (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37809/console" [puppet] - 10https://gerrit.wikimedia.org/r/850191 (owner: 10Klausman) [15:57:51] (03CR) 10Klausman: wikilables: fix wrong path for Postgres tuning.conf [puppet] - 10https://gerrit.wikimedia.org/r/850191 (owner: 10Klausman) [15:58:20] (03PS2) 10Klausman: wikilabels: fix wrong path for Postgres tuning.conf [puppet] - 10https://gerrit.wikimedia.org/r/850191 (https://phabricator.wikimedia.org/T307389) [15:59:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T318950)', diff saved to https://phabricator.wikimedia.org/P36861 and previous config saved to /var/cache/conftool/dbconfig/20221027-155902-ladsgroup.json [15:59:04] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2167.codfw.wmnet with reason: Maintenance [15:59:09] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [15:59:39] !log aikochou@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [15:59:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2167.codfw.wmnet with reason: Maintenance [15:59:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2167:3311 (T318950)', diff saved to https://phabricator.wikimedia.org/P36862 and previous config saved to /var/cache/conftool/dbconfig/20221027-155946-ladsgroup.json [15:59:55] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: / (spec from root) is CRITICAL: Test spec from root returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [16:00:04] jbond and rzl: #bothumor I � Unicode. All rise for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221027T1600). [16:00:04] zabe: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:10] (FNMNotReported) firing: FastNetMon metrics not reported - https://wikitech.wikimedia.org/wiki/Fastnetmon - https://w.wiki/8oU - https://alerts.wikimedia.org/?q=alertname%3DFNMNotReported [16:00:24] hey [16:00:40] o/ [16:00:53] 10SRE, 10FR-MW-Vagrant, 10Fundraising-Backlog, 10MediaWiki-Vagrant: Package XDebug 2.9 for apt.wikimedia.org - https://phabricator.wikimedia.org/T220406 (10Tgr) 05Open→03Invalid After {T271649} and the switch to PHP 7.4, Vagrant now uses XDebug 3. [16:00:55] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [16:00:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311 (T318950)', diff saved to https://phabricator.wikimedia.org/P36863 and previous config saved to /var/cache/conftool/dbconfig/20221027-160056-ladsgroup.json [16:01:09] 10SRE, 10Infrastructure-Foundations, 10Packaging: Update php-xdebug to 2.7.2 in apt.wikimedia.org - https://phabricator.wikimedia.org/T263933 (10Tgr) 05Open→03Invalid After {T271649} and the switch to PHP 7.4, Vagrant now uses XDebug 3. [16:01:59] (03PS1) 10Ssingh: cp4051: update site.pp and related configs for cp (upload) role [puppet] - 10https://gerrit.wikimedia.org/r/850192 (https://phabricator.wikimedia.org/T317244) [16:02:01] (03PS1) 10Ssingh: cp4052: update site.pp and related configs for cp (upload) role [puppet] - 10https://gerrit.wikimedia.org/r/850193 (https://phabricator.wikimedia.org/T317244) [16:02:40] (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37810/console" [puppet] - 10https://gerrit.wikimedia.org/r/850191 (https://phabricator.wikimedia.org/T307389) (owner: 10Klausman) [16:03:01] (03PS2) 10Jbond: R:rsync::manifests::server::module: Strengthen types [puppet] - 10https://gerrit.wikimedia.org/r/850172 [16:03:59] (KubernetesAPILatency) firing: (14) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:04:21] PROBLEM - Check systemd state on ms-be2064 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:05:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316 (T318605)', diff saved to https://phabricator.wikimedia.org/P36864 and previous config saved to /var/cache/conftool/dbconfig/20221027-160511-ladsgroup.json [16:05:18] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [16:07:34] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install cp40[37-51] - https://phabricator.wikimedia.org/T317244 (10ssingh) [16:08:01] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install cp40[37-51] - https://phabricator.wikimedia.org/T317244 (10ssingh) [16:08:18] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: codfw1dev: don't hardcode interface names [puppet] - 10https://gerrit.wikimedia.org/r/850195 (https://phabricator.wikimedia.org/T319184) [16:08:59] (KubernetesAPILatency) firing: (15) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:11:05] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2064 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [16:11:15] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/850171 (owner: 10Jbond) [16:11:19] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4050.ulsfo.wmnet with reason: host reimage [16:13:59] (KubernetesAPILatency) firing: (15) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:14:23] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host testvm2004.codfw.wmnet [16:15:10] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4050.ulsfo.wmnet with reason: host reimage [16:15:15] 10SRE, 10MW-on-K8s, 10serviceops: Re-think how we separate traffic to mediawiki in clusters. - https://phabricator.wikimedia.org/T291918 (10Krinkle) [16:16:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311', diff saved to https://phabricator.wikimedia.org/P36865 and previous config saved to /var/cache/conftool/dbconfig/20221027-161602-ladsgroup.json [16:16:34] (03CR) 10Elukey: [C: 03+1] wikilabels: fix wrong path for Postgres tuning.conf [puppet] - 10https://gerrit.wikimedia.org/r/850191 (https://phabricator.wikimedia.org/T307389) (owner: 10Klausman) [16:18:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:18:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host testvm2004.codfw.wmnet [16:20:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316', diff saved to https://phabricator.wikimedia.org/P36866 and previous config saved to /var/cache/conftool/dbconfig/20221027-162018-ladsgroup.json [16:20:39] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host testvm2005.codfw.wmnet [16:21:04] (03CR) 10Andrew Bogott: [C: 03+1] cloudgw: codfw1dev: don't hardcode interface names [puppet] - 10https://gerrit.wikimedia.org/r/850195 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez) [16:21:14] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1] "PCC as expected: https://puppet-compiler.wmflabs.org/pcc-worker1001/37811/" [puppet] - 10https://gerrit.wikimedia.org/r/850195 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez) [16:21:27] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1 C: 03+2] cloudgw: codfw1dev: don't hardcode interface names [puppet] - 10https://gerrit.wikimedia.org/r/850195 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez) [16:23:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:23:38] (03CR) 10Subramanya Sastry: [C: 03+1] Disable wgParserEnableLegacyMediaDOM on itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/844073 (https://phabricator.wikimedia.org/T314318) (owner: 10Arlolra) [16:23:59] (KubernetesAPILatency) firing: (14) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:24:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host testvm2005.codfw.wmnet [16:25:10] RECOVERY - Check systemd state on ms-be2064 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:27:21] !log dancy@deploy1002 Started scap: testing mw-debug [16:27:55] jbond, rzl: any of you around? [16:28:43] zabe: here sorry i missed the ping earlier let me take a look [16:28:46] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install cloudcontrol2005-dev, clouddb2002-dev, cloudgw2003-dev - https://phabricator.wikimedia.org/T306854 (10aborrero) [16:29:01] no worries [16:29:11] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10aborrero) [16:31:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311', diff saved to https://phabricator.wikimedia.org/P36867 and previous config saved to /var/cache/conftool/dbconfig/20221027-163109-ladsgroup.json [16:33:19] !log dancy@deploy1002 scap failed: CalledProcessError Command 'sudo -u mwbuilder /usr/bin/make -C /srv/mwbuilder/release/make-container-image -f Makefile build-and-push-all-images http_proxy=http://webproxy.eqiad.wmnet:8080 https_proxy=http://webproxy.eqiad.wmnet:8080 GIT_BASE=https://gerrit.wikimedia.org/r/ MW_CONFIG_BRANCH=master workdir_volume=/srv/mediawiki-staging mv_image_name=docker-registry.discovery.wmnet/restric [16:33:19] ted/mediawiki-multiversion webserver_image_name=docker-registry.discovery.wmnet/restricted/mediawiki-webserver MV_BASE_PACKAGES= MV_EXTRA_CA_CERT=' returned non-zero exit status 2. (duration: 05m 58s) [16:34:33] zabe: im just going to ping in serviceops to get someone elses to approve 843001 as im not too famiure with tha [16:34:56] sure [16:35:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316', diff saved to https://phabricator.wikimedia.org/P36868 and previous config saved to /var/cache/conftool/dbconfig/20221027-163524-ladsgroup.json [16:39:13] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4050.ulsfo.wmnet with OS buster [16:41:59] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be2064 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [16:43:13] (03CR) 10Ssingh: [C: 03+2] cp4051: update site.pp and related configs for cp (upload) role [puppet] - 10https://gerrit.wikimedia.org/r/850192 (https://phabricator.wikimedia.org/T317244) (owner: 10Ssingh) [16:45:10] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4051.ulsfo.wmnet with OS buster [16:46:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311 (T318950)', diff saved to https://phabricator.wikimedia.org/P36869 and previous config saved to /var/cache/conftool/dbconfig/20221027-164615-ladsgroup.json [16:46:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2170.codfw.wmnet with reason: Maintenance [16:46:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2170.codfw.wmnet with reason: Maintenance [16:46:22] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [16:46:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2170:3311 (T318950)', diff saved to https://phabricator.wikimedia.org/P36870 and previous config saved to /var/cache/conftool/dbconfig/20221027-164626-ladsgroup.json [16:47:31] !log dancy@deploy1002 Started scap: testing mw-debug [16:47:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311 (T318950)', diff saved to https://phabricator.wikimedia.org/P36871 and previous config saved to /var/cache/conftool/dbconfig/20221027-164735-ladsgroup.json [16:47:54] (03CR) 10Klausman: [V: 03+1 C: 03+2] wikilabels: fix wrong path for Postgres tuning.conf [puppet] - 10https://gerrit.wikimedia.org/r/850191 (https://phabricator.wikimedia.org/T307389) (owner: 10Klausman) [16:48:23] !log dancy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [16:48:23] !log dancy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [16:50:03] (ProbeDown) firing: (2) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:50:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316 (T318605)', diff saved to https://phabricator.wikimedia.org/P36872 and previous config saved to /var/cache/conftool/dbconfig/20221027-165031-ladsgroup.json [16:50:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2180.codfw.wmnet with reason: Maintenance [16:50:37] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [16:50:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2180.codfw.wmnet with reason: Maintenance [16:50:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2180 (T318605)', diff saved to https://phabricator.wikimedia.org/P36873 and previous config saved to /var/cache/conftool/dbconfig/20221027-165052-ladsgroup.json [16:51:48] (03CR) 10BBlack: varnish: Conditionally set WMF-Last-Access cookie (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/849184 (https://phabricator.wikimedia.org/T262996) (owner: 10BCornwall) [16:52:24] !log dancy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [16:52:25] (03CR) 10BCornwall: varnish: Conditionally set WMF-Last-Access cookie (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/849184 (https://phabricator.wikimedia.org/T262996) (owner: 10BCornwall) [16:52:38] !log dancy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [16:52:56] !log dancy@deploy1002 dancy: testing mw-debug synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [16:52:59] !log dancy@deploy1002 Sync cancelled. [16:53:59] (KubernetesAPILatency) firing: (12) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:54:03] (03CR) 10Andrew Bogott: "epic pcc run in progress: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37812/console" [puppet] - 10https://gerrit.wikimedia.org/r/849473 (owner: 10David Caro) [16:55:03] (ProbeDown) resolved: (2) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:55:54] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [16:55:55] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [16:55:56] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [16:55:59] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [16:56:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T318605)', diff saved to https://phabricator.wikimedia.org/P36874 and previous config saved to /var/cache/conftool/dbconfig/20221027-165659-ladsgroup.json [16:57:05] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [16:58:59] (KubernetesAPILatency) firing: (12) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:00:04] bd808: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Technical Engagement weekly deploy (Toolhub, Developer portal, Striker) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221027T1700). [17:01:39] Nothing for me to deploy today. [17:02:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311', diff saved to https://phabricator.wikimedia.org/P36875 and previous config saved to /var/cache/conftool/dbconfig/20221027-170242-ladsgroup.json [17:03:59] (KubernetesAPILatency) firing: (13) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:11:04] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4051.ulsfo.wmnet with reason: host reimage [17:12:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P36876 and previous config saved to /var/cache/conftool/dbconfig/20221027-171205-ladsgroup.json [17:12:57] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install cp40[37-51] - https://phabricator.wikimedia.org/T317244 (10RobH) cp4040 and cp4048 had the DAC cable clicked in on the NIC, but not pressed in quite all the way. Reseated and the link lights came up immediately. [17:14:39] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4051.ulsfo.wmnet with reason: host reimage [17:17:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311', diff saved to https://phabricator.wikimedia.org/P36877 and previous config saved to /var/cache/conftool/dbconfig/20221027-171749-ladsgroup.json [17:27:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P36878 and previous config saved to /var/cache/conftool/dbconfig/20221027-172712-ladsgroup.json [17:32:31] (03PS10) 10Xcollazo: Modify jupyterhub config to point to conda-analytics instead of anaconda-wmf. [puppet] - 10https://gerrit.wikimedia.org/r/843959 (https://phabricator.wikimedia.org/T321088) [17:32:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311 (T318950)', diff saved to https://phabricator.wikimedia.org/P36879 and previous config saved to /var/cache/conftool/dbconfig/20221027-173255-ladsgroup.json [17:32:58] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2173.codfw.wmnet with reason: Maintenance [17:33:02] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [17:33:11] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2173.codfw.wmnet with reason: Maintenance [17:33:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2094.codfw.wmnet with reason: Maintenance [17:33:15] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2094.codfw.wmnet with reason: Maintenance [17:33:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2173 (T318950)', diff saved to https://phabricator.wikimedia.org/P36880 and previous config saved to /var/cache/conftool/dbconfig/20221027-173322-ladsgroup.json [17:33:59] (KubernetesAPILatency) firing: (10) High Kubernetes API latency (LIST certificates) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:35:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T318950)', diff saved to https://phabricator.wikimedia.org/P36881 and previous config saved to /var/cache/conftool/dbconfig/20221027-173532-ladsgroup.json [17:35:38] (03PS4) 10Jbond: R:rsync::manifests::server::module: add type validation [puppet] - 10https://gerrit.wikimedia.org/r/850171 [17:35:40] (03PS3) 10Jbond: R:rsync::manifests::server::module: Strengthen types [puppet] - 10https://gerrit.wikimedia.org/r/850172 [17:35:42] (03PS1) 10Jbond: C:statistics::rsyncd: use the nobody user explicitly [puppet] - 10https://gerrit.wikimedia.org/r/850233 [17:37:39] PROBLEM - Host parse1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:37:40] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host kafka-stretch1001.eqiad.wmnet with OS bullseye [17:37:45] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, and 2 others: Q1:rack/setup/install kafka-stretch100[12] - https://phabricator.wikimedia.org/T314156 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host kafka-stretch1001.eqiad.wmnet with OS bullseye [17:37:49] !log robh@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti4002 [17:37:49] !log robh@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti4002 [17:37:55] !log robh@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti4006 [17:38:26] !log robh@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti4006 [17:38:57] (03PS2) 10Jbond: C:statistics::rsyncd: use the nobody user explicitly [puppet] - 10https://gerrit.wikimedia.org/r/850233 [17:39:15] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host ganeti4006.mgmt.ulsfo.wmnet with reboot policy FORCED [17:39:18] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4051.ulsfo.wmnet with OS buster [17:41:47] (03CR) 10BBlack: [C: 04-1] varnish: Conditionally set WMF-Last-Access cookie (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/849184 (https://phabricator.wikimedia.org/T262996) (owner: 10BCornwall) [17:42:11] !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti4006.mgmt.ulsfo.wmnet with reboot policy FORCED [17:42:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T318605)', diff saved to https://phabricator.wikimedia.org/P36882 and previous config saved to /var/cache/conftool/dbconfig/20221027-174219-ladsgroup.json [17:42:30] !log robh@cumin2002 START - Cookbook sre.dns.netbox [17:42:36] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [17:43:37] RECOVERY - Host parse1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.84 ms [17:43:46] (03PS4) 10BCornwall: varnish: Conditionally set WMF-Last-Access cookie [puppet] - 10https://gerrit.wikimedia.org/r/849184 (https://phabricator.wikimedia.org/T262996) [17:44:19] (03CR) 10BCornwall: varnish: Conditionally set WMF-Last-Access cookie (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/849184 (https://phabricator.wikimedia.org/T262996) (owner: 10BCornwall) [17:45:12] !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:45:43] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host ganeti4006.mgmt.ulsfo.wmnet with reboot policy FORCED [17:47:55] (03PS5) 10BCornwall: varnish: Conditionally set WMF-Last-Access cookie [puppet] - 10https://gerrit.wikimedia.org/r/849184 (https://phabricator.wikimedia.org/T262996) [17:48:22] 10SRE, 10SRE-swift-storage, 10Data-Engineering-Planning, 10Wikidata, and 3 others: Clean up the rdf-streaming-updater-codfw container from thanos-swift. - https://phabricator.wikimedia.org/T316031 (10bking) 05Open→03Resolved [17:48:33] 10SRE, 10SRE-swift-storage, 10Data-Engineering-Planning, 10Wikidata, and 4 others: wdqs space usage on thanos-swift - https://phabricator.wikimedia.org/T314835 (10bking) [17:50:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P36883 and previous config saved to /var/cache/conftool/dbconfig/20221027-175038-ladsgroup.json [17:52:00] !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti4006.mgmt.ulsfo.wmnet with reboot policy FORCED [17:53:10] (03CR) 10BBlack: [C: 03+1] "LGTM! This is probably functionally correct now. We should probably validate against existing VTC tests, and possibly define a new one if" [puppet] - 10https://gerrit.wikimedia.org/r/849184 (https://phabricator.wikimedia.org/T262996) (owner: 10BCornwall) [17:54:58] (03CR) 10Xcollazo: "File rename fixes puppet issue:" [puppet] - 10https://gerrit.wikimedia.org/r/843959 (https://phabricator.wikimedia.org/T321088) (owner: 10Xcollazo) [17:57:53] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10RobH) [17:59:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, and 2 others: Q1:rack/setup/install kafka-stretch100[12] - https://phabricator.wikimedia.org/T314156 (10Cmjohnson) @Ottomata this is failing in the installer because of the raid configuration. I probably do not have it set correctly. Can you give... [18:00:32] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install cp40[37-51] - https://phabricator.wikimedia.org/T317244 (10ssingh) [18:01:19] (03CR) 10Jbond: [C: 03+2] apache: Drop ve.wikimedia.org rewrite [puppet] - 10https://gerrit.wikimedia.org/r/843569 (https://phabricator.wikimedia.org/T320890) (owner: 10Zabe) [18:02:08] 10SRE: service::docker with 'latest' version behaves poorly if the host runs out of disk space - https://phabricator.wikimedia.org/T321851 (10taavi) [18:02:39] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install cp40[37-51] - https://phabricator.wikimedia.org/T317244 (10RobH) [18:02:41] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1099.eqiad.wmnet with reason: Maintenance [18:02:48] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 72, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:02:55] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1099.eqiad.wmnet with reason: Maintenance [18:03:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1099:3311 (T318950)', diff saved to https://phabricator.wikimedia.org/P36884 and previous config saved to /var/cache/conftool/dbconfig/20221027-180301-ladsgroup.json [18:03:07] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [18:03:46] (03PS1) 10Btullis: Update the spark and spark-operator images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/850244 (https://phabricator.wikimedia.org/T318730) [18:04:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T318950)', diff saved to https://phabricator.wikimedia.org/P36885 and previous config saved to /var/cache/conftool/dbconfig/20221027-180408-ladsgroup.json [18:05:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P36886 and previous config saved to /var/cache/conftool/dbconfig/20221027-180545-ladsgroup.json [18:06:38] (03CR) 10Btullis: Update the spark and spark-operator images (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/850244 (https://phabricator.wikimedia.org/T318730) (owner: 10Btullis) [18:09:14] (03PS1) 10Ahmon Dancy: scap::master: Clone the scap repo from gitlab [puppet] - 10https://gerrit.wikimedia.org/r/850246 (https://phabricator.wikimedia.org/T321847) [18:11:14] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:12:44] PROBLEM - Host parse1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:15:41] (03PS2) 10Ahmon Dancy: beta::autoupdater: Remove more obsolete stuff after scap prep auto [puppet] - 10https://gerrit.wikimedia.org/r/753787 [18:16:10] (03PS2) 10Ahmon Dancy: scap::master: Clone the scap repo from gitlab [puppet] - 10https://gerrit.wikimedia.org/r/850246 (https://phabricator.wikimedia.org/T321847) [18:16:12] (03PS1) 10Ahmon Dancy: git::clone: Append .git to clone url for gitlab source [puppet] - 10https://gerrit.wikimedia.org/r/850249 [18:19:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P36887 and previous config saved to /var/cache/conftool/dbconfig/20221027-181915-ladsgroup.json [18:20:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T318950)', diff saved to https://phabricator.wikimedia.org/P36888 and previous config saved to /var/cache/conftool/dbconfig/20221027-182051-ladsgroup.json [18:20:54] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2174.codfw.wmnet with reason: Maintenance [18:20:58] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [18:21:07] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2174.codfw.wmnet with reason: Maintenance [18:21:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2174 (T318950)', diff saved to https://phabricator.wikimedia.org/P36889 and previous config saved to /var/cache/conftool/dbconfig/20221027-182113-ladsgroup.json [18:23:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T318950)', diff saved to https://phabricator.wikimedia.org/P36890 and previous config saved to /var/cache/conftool/dbconfig/20221027-182323-ladsgroup.json [18:24:14] RECOVERY - Host parse1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.03 ms [18:24:59] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, and 2 others: Q1:rack/setup/install kafka-stretch100[12] - https://phabricator.wikimedia.org/T314156 (10Ottomata) What's the error you are getting? See https://phabricator.wikimedia.org/T314160#8166075 and below. In codfw, sda and sdb were mapped... [18:28:07] (03CR) 10Ottomata: [C: 03+1] C:statistics::rsyncd: use the nobody user explicitly [puppet] - 10https://gerrit.wikimedia.org/r/850233 (owner: 10Jbond) [18:34:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P36891 and previous config saved to /var/cache/conftool/dbconfig/20221027-183421-ladsgroup.json [18:38:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P36892 and previous config saved to /var/cache/conftool/dbconfig/20221027-183830-ladsgroup.json [18:47:16] (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [18:49:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T318950)', diff saved to https://phabricator.wikimedia.org/P36893 and previous config saved to /var/cache/conftool/dbconfig/20221027-184928-ladsgroup.json [18:49:30] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [18:49:34] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [18:49:43] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [18:49:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3311 (T318950)', diff saved to https://phabricator.wikimedia.org/P36894 and previous config saved to /var/cache/conftool/dbconfig/20221027-184949-ladsgroup.json [18:50:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T318950)', diff saved to https://phabricator.wikimedia.org/P36895 and previous config saved to /var/cache/conftool/dbconfig/20221027-185057-ladsgroup.json [18:53:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P36896 and previous config saved to /var/cache/conftool/dbconfig/20221027-185336-ladsgroup.json [19:00:45] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:01:56] (03CR) 10Jbond: [C: 03+2] Add Apache configuration for vewikimedia [puppet] - 10https://gerrit.wikimedia.org/r/843001 (https://phabricator.wikimedia.org/T320890) (owner: 10Zabe) [19:06:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P36897 and previous config saved to /var/cache/conftool/dbconfig/20221027-190604-ladsgroup.json [19:08:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T318950)', diff saved to https://phabricator.wikimedia.org/P36898 and previous config saved to /var/cache/conftool/dbconfig/20221027-190843-ladsgroup.json [19:08:45] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2176.codfw.wmnet with reason: Maintenance [19:08:50] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [19:08:58] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2176.codfw.wmnet with reason: Maintenance [19:09:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2176 (T318950)', diff saved to https://phabricator.wikimedia.org/P36899 and previous config saved to /var/cache/conftool/dbconfig/20221027-190904-ladsgroup.json [19:11:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T318950)', diff saved to https://phabricator.wikimedia.org/P36900 and previous config saved to /var/cache/conftool/dbconfig/20221027-191114-ladsgroup.json [19:14:27] 10SRE, 10SRE-Access-Requests: Requesting deployment group membership for mfossati - https://phabricator.wikimedia.org/T321772 (10thcipriani) Approved as keeper of `deployment` group (probably fine to remove from `restricted` as it's a subset) >>! In T321772#8348479, @mfossati wrote: > @thcipriani : not sure a... [19:21:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P36901 and previous config saved to /var/cache/conftool/dbconfig/20221027-192110-ladsgroup.json [19:26:09] RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 81, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:26:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P36902 and previous config saved to /var/cache/conftool/dbconfig/20221027-192621-ladsgroup.json [19:26:50] (03PS3) 10BCornwall: prometheus: Add ats header/body size total metrics [puppet] - 10https://gerrit.wikimedia.org/r/845688 (https://phabricator.wikimedia.org/T284304) [19:26:52] (03CR) 10BCornwall: prometheus: Add ats header/body size total metrics (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/845688 (https://phabricator.wikimedia.org/T284304) (owner: 10BCornwall) [19:29:34] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, and 2 others: Q1:rack/setup/install kafka-stretch100[12] - https://phabricator.wikimedia.org/T314156 (10Cmjohnson) @Ottomata yes, that is what's happening here [19:30:24] (03PS1) 10RobH: ganeti4006 [puppet] - 10https://gerrit.wikimedia.org/r/850260 (https://phabricator.wikimedia.org/T317247) [19:31:14] (03CR) 10RobH: [C: 03+2] ganeti4006 [puppet] - 10https://gerrit.wikimedia.org/r/850260 (https://phabricator.wikimedia.org/T317247) (owner: 10RobH) [19:31:15] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10RobH) [19:36:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T318950)', diff saved to https://phabricator.wikimedia.org/P36903 and previous config saved to /var/cache/conftool/dbconfig/20221027-193617-ladsgroup.json [19:36:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1106.eqiad.wmnet with reason: Maintenance [19:36:24] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [19:36:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1106.eqiad.wmnet with reason: Maintenance [19:36:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [19:36:50] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [19:36:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1106 (T318950)', diff saved to https://phabricator.wikimedia.org/P36904 and previous config saved to /var/cache/conftool/dbconfig/20221027-193656-ladsgroup.json [19:38:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T318950)', diff saved to https://phabricator.wikimedia.org/P36905 and previous config saved to /var/cache/conftool/dbconfig/20221027-193803-ladsgroup.json [19:41:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P36906 and previous config saved to /var/cache/conftool/dbconfig/20221027-194127-ladsgroup.json [19:41:44] (03CR) 10Andrew Bogott: openstack: make domain-aware (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/845004 (https://phabricator.wikimedia.org/T321349) (owner: 10Andrew Bogott) [19:49:40] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host ganeti4006.mgmt.ulsfo.wmnet with reboot policy FORCED [19:50:25] !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti4006.mgmt.ulsfo.wmnet with reboot policy FORCED [19:50:46] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host ganeti4006.mgmt.ulsfo.wmnet with reboot policy FORCED [19:51:31] !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ganeti4006.mgmt.ulsfo.wmnet with reboot policy FORCED [19:52:36] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, and 2 others: Q1:rack/setup/install kafka-stretch100[12] - https://phabricator.wikimedia.org/T314156 (10Ottomata) K, looks like RobH was able to [[ https://phabricator.wikimedia.org/T314160#8166665 | fix it somehow ]]. [19:53:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P36907 and previous config saved to /var/cache/conftool/dbconfig/20221027-195310-ladsgroup.json [19:53:15] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install kafka-logging100[45] - https://phabricator.wikimedia.org/T313960 (10Cmjohnson) @Jclark-ctr can you look at kafka-logging1005 and make sure the network cable is connected and the right port. Sorry to bug you on this... [19:54:09] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install contint1002 - https://phabricator.wikimedia.org/T313830 (10Cmjohnson) I added these to netbox but when I ran the dns script and home, nothing changed. [19:56:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T318950)', diff saved to https://phabricator.wikimedia.org/P36908 and previous config saved to /var/cache/conftool/dbconfig/20221027-195634-ladsgroup.json [19:56:40] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [19:59:30] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host ganeti4006.mgmt.ulsfo.wmnet with reboot policy FORCED [19:59:49] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:00:05] brennen: (Dis)respected human, time to deploy UTC late backport and config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221027T2000). Please do the needful. [20:00:05] danisztls and koi: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:10] (FNMNotReported) firing: FastNetMon metrics not reported - https://wikitech.wikimedia.org/wiki/Fastnetmon - https://w.wiki/8oU - https://alerts.wikimedia.org/?q=alertname%3DFNMNotReported [20:00:21] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:01:13] o/ [20:02:39] sorry, my mistake, I have nothing to deploy [20:03:01] Hey danisztls, we're about to deploy your patch. :) [20:03:07] koi: cool, thanks for clarifying [20:03:29] new robot [20:03:31] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.224 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:04:07] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48827 bytes in 0.103 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:06:44] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kindrobot@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/834048 (https://phabricator.wikimedia.org/T318333) (owner: 10DDesouza) [20:07:43] (03PS5) 10Stef Dunlap: Deploy Research Incentive survey on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/834048 (https://phabricator.wikimedia.org/T318333) (owner: 10DDesouza) [20:07:56] (03CR) 10TrainBranchBot: "Approved by kindrobot@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/834048 (https://phabricator.wikimedia.org/T318333) (owner: 10DDesouza) [20:08:09] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti4006.mgmt.ulsfo.wmnet with reboot policy FORCED [20:08:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P36909 and previous config saved to /var/cache/conftool/dbconfig/20221027-200817-ladsgroup.json [20:08:40] (03Merged) 10jenkins-bot: Deploy Research Incentive survey on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/834048 (https://phabricator.wikimedia.org/T318333) (owner: 10DDesouza) [20:08:46] !log robh@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti4006.ulsfo.wmnet with OS buster [20:08:54] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin2002 for host ganeti4006.ulsfo.wmnet with OS buster [20:08:54] !log kindrobot@deploy1002 Started scap: Backport for [[gerrit:834048|Deploy Research Incentive survey on enwiki (T318333)]] [20:08:59] T318333: Deploy Research Incentive Survey targeting Sub-Saharan Africa and Latin America readers - https://phabricator.wikimedia.org/T318333 [20:09:14] !log kindrobot@deploy1002 kindrobot and dani: Backport for [[gerrit:834048|Deploy Research Incentive survey on enwiki (T318333)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet [20:10:05] danisztls: your changes are ready to check on debug [20:10:53] kindrobot: lgtm [20:13:57] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [20:14:51] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [20:14:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [20:15:26] !log kindrobot@deploy1002 Finished scap: Backport for [[gerrit:834048|Deploy Research Incentive survey on enwiki (T318333)]] (duration: 06m 32s) [20:15:32] T318333: Deploy Research Incentive Survey targeting Sub-Saharan Africa and Latin America readers - https://phabricator.wikimedia.org/T318333 [20:15:48] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [20:16:08] danisztls: deployment finished [20:16:22] kindrobot: thanks [20:16:55] !log End of UTC late backport deployment window [20:16:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:46] (03PS1) 10Stang: Add main page on non-English privatewiki to wgWhitelistRead [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850266 (https://phabricator.wikimedia.org/T321796) [20:20:48] \o/ new backport deployers [20:21:04] :D [20:21:31] PROBLEM - Check systemd state on doc1002 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-doc-doc2001.codfw.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:23:19] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:23:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T318950)', diff saved to https://phabricator.wikimedia.org/P36910 and previous config saved to /var/cache/conftool/dbconfig/20221027-202323-ladsgroup.json [20:23:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1107.eqiad.wmnet with reason: Maintenance [20:23:30] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [20:23:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1107.eqiad.wmnet with reason: Maintenance [20:23:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1107 (T318950)', diff saved to https://phabricator.wikimedia.org/P36911 and previous config saved to /var/cache/conftool/dbconfig/20221027-202345-ladsgroup.json [20:24:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1107 (T318950)', diff saved to https://phabricator.wikimedia.org/P36912 and previous config saved to /var/cache/conftool/dbconfig/20221027-202452-ladsgroup.json [20:29:04] !log robh@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti4006.ulsfo.wmnet with reason: host reimage [20:32:32] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti4006.ulsfo.wmnet with reason: host reimage [20:35:36] !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti4006.ulsfo.wmnet with OS buster [20:35:43] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin2002 for host ganeti4006.ulsfo.wmnet with OS buster executed with errors:... [20:36:11] !log robh@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti4006.ulsfo.wmnet with OS bullseye [20:36:19] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin2002 for host ganeti4006.ulsfo.wmnet with OS bullseye [20:39:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1107', diff saved to https://phabricator.wikimedia.org/P36913 and previous config saved to /var/cache/conftool/dbconfig/20221027-203959-ladsgroup.json [20:42:47] (03CR) 10Ssingh: [C: 03+2] cp4052: update site.pp and related configs for cp (upload) role [puppet] - 10https://gerrit.wikimedia.org/r/850193 (https://phabricator.wikimedia.org/T317244) (owner: 10Ssingh) [20:47:00] !log Running extensions/Wikibase/client/maintenance/populateUnexpectedUnconnectedPagePageProp.php for commonswiki (T300770) [20:47:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:47:06] T300770: Special:UnconnectedPages for main namespace is slow (ca. 10 seconds) - https://phabricator.wikimedia.org/T300770 [20:47:42] !log Running extensions/Wikibase/client/maintenance/populateUnexpectedUnconnectedPagePageProp.php for enwiki, enwiktionary (T300770) [20:47:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:47:53] (03PS2) 10Ssingh: cp4052: update site.pp and related configs for cp (upload) role [puppet] - 10https://gerrit.wikimedia.org/r/850193 (https://phabricator.wikimedia.org/T317244) [20:50:54] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4052.ulsfo.wmnet with OS buster [20:52:12] (03PS1) 10Ssingh: cp4040: update site.pp and related configs for cp (text) role [puppet] - 10https://gerrit.wikimedia.org/r/850273 (https://phabricator.wikimedia.org/T317244) [20:53:31] !log robh@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti4006.ulsfo.wmnet with reason: host reimage [20:55:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1107', diff saved to https://phabricator.wikimedia.org/P36914 and previous config saved to /var/cache/conftool/dbconfig/20221027-205505-ladsgroup.json [20:56:07] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4052.ulsfo.wmnet with OS buster [20:56:08] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti4006.ulsfo.wmnet with reason: host reimage [20:56:31] !log sudo ipmitool -I lanplus -H "cp4052.mgmt.ulsfo.wmnet" -U root -E chassis power cycle [20:56:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:57:32] ACKNOWLEDGEMENT - MD RAID on ganeti4006 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.128.0.38. Check system logs on 10.128.0.38 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T321863 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [20:57:37] 10SRE, 10ops-ulsfo: Degraded RAID on ganeti4006 - https://phabricator.wikimedia.org/T321863 (10ops-monitoring-bot) [20:58:59] (KubernetesAPILatency) firing: (9) High Kubernetes API latency (LIST certificates) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:59:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:00:13] PROBLEM - rsyslog TLS listener on port 6514 on centrallog2002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Logs [21:02:07] RECOVERY - rsyslog TLS listener on port 6514 on centrallog2002 is OK: SSL OK - Certificate centrallog2002.codfw.wmnet valid until 2026-09-27 13:35:26 +0000 (expires in 1430 days) https://wikitech.wikimedia.org/wiki/Logs [21:02:32] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host cp4052.mgmt.ulsfo.wmnet with reboot policy FORCED [21:04:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:08:35] PROBLEM - Host parse1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [21:10:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1107 (T318950)', diff saved to https://phabricator.wikimedia.org/P36915 and previous config saved to /var/cache/conftool/dbconfig/20221027-211012-ladsgroup.json [21:10:14] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1118.eqiad.wmnet with reason: Maintenance [21:10:18] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [21:10:27] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp4052.mgmt.ulsfo.wmnet with reboot policy FORCED [21:10:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1118.eqiad.wmnet with reason: Maintenance [21:10:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1118 (T318950)', diff saved to https://phabricator.wikimedia.org/P36916 and previous config saved to /var/cache/conftool/dbconfig/20221027-211034-ladsgroup.json [21:11:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118 (T318950)', diff saved to https://phabricator.wikimedia.org/P36917 and previous config saved to /var/cache/conftool/dbconfig/20221027-211142-ladsgroup.json [21:12:15] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host cp4052.mgmt.ulsfo.wmnet with reboot policy FORCED [21:13:22] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti4006.ulsfo.wmnet with OS bullseye [21:13:28] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by robh@cumin2002 for host ganeti4006.ulsfo.wmnet with OS bullseye completed: - ganeti4... [21:13:59] (KubernetesAPILatency) firing: (11) High Kubernetes API latency (LIST certificates) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:14:10] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10RobH) [21:16:57] RECOVERY - Check systemd state on doc1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:18:54] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install cp40[37-51] - https://phabricator.wikimedia.org/T317244 (10RobH) When trying to run the sre.hosts.provision script on cp4052, I get the following issue: ` [1/30, retrying in 30.00s] Polling task: JID_669057087428 not co... [21:20:21] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:20:45] !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp4052.mgmt.ulsfo.wmnet with reboot policy FORCED [21:21:09] PROBLEM - SSH on mw1325.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:26:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118', diff saved to https://phabricator.wikimedia.org/P36918 and previous config saved to /var/cache/conftool/dbconfig/20221027-212648-ladsgroup.json [21:28:59] (KubernetesAPILatency) firing: (11) High Kubernetes API latency (LIST certificates) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:34:21] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4052.ulsfo.wmnet with OS buster [21:34:40] 10SRE, 10ops-ulsfo, 10Infrastructure-Foundations: Degraded RAID on ganeti4006 - https://phabricator.wikimedia.org/T321863 (10Peachey88) [21:41:03] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4052.ulsfo.wmnet with OS buster [21:41:19] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:41:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118', diff saved to https://phabricator.wikimedia.org/P36919 and previous config saved to /var/cache/conftool/dbconfig/20221027-214154-ladsgroup.json [21:45:39] (03CR) 10Andrew Bogott: "prod pcc run looks good:" [puppet] - 10https://gerrit.wikimedia.org/r/849473 (owner: 10David Caro) [21:46:28] !log sukhe@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cp4052 [21:46:31] !log sukhe@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp4052 [21:46:38] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4052.ulsfo.wmnet with OS buster [21:56:05] (03PS1) 10BBlack: Add fake digicert-2022 keys [labs/private] - 10https://gerrit.wikimedia.org/r/850279 [21:56:14] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4052.ulsfo.wmnet with OS buster [21:56:41] (03PS1) 10Ssingh: Revert "cp4052: update site.pp and related configs for cp (upload) role" [puppet] - 10https://gerrit.wikimedia.org/r/850085 [21:57:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118 (T318950)', diff saved to https://phabricator.wikimedia.org/P36920 and previous config saved to /var/cache/conftool/dbconfig/20221027-215701-ladsgroup.json [21:57:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1119.eqiad.wmnet with reason: Maintenance [21:57:08] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [21:57:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1119.eqiad.wmnet with reason: Maintenance [21:57:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1119 (T318950)', diff saved to https://phabricator.wikimedia.org/P36921 and previous config saved to /var/cache/conftool/dbconfig/20221027-215723-ladsgroup.json [21:57:55] (03CR) 10Ssingh: [C: 03+2] Revert "cp4052: update site.pp and related configs for cp (upload) role" [puppet] - 10https://gerrit.wikimedia.org/r/850085 (owner: 10Ssingh) [21:58:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T318950)', diff saved to https://phabricator.wikimedia.org/P36922 and previous config saved to /var/cache/conftool/dbconfig/20221027-215831-ladsgroup.json [22:09:11] (03CR) 10BBlack: [V: 03+2 C: 03+2] Add fake digicert-2022 keys [labs/private] - 10https://gerrit.wikimedia.org/r/850279 (owner: 10BBlack) [22:09:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [22:13:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P36923 and previous config saved to /var/cache/conftool/dbconfig/20221027-221337-ladsgroup.json [22:13:59] (KubernetesAPILatency) firing: (11) High Kubernetes API latency (LIST customresourcedefinitions) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:14:37] (03PS1) 10BBlack: Add digicert-2022 unified cert files [puppet] - 10https://gerrit.wikimedia.org/r/850285 (https://phabricator.wikimedia.org/T313328) [22:14:39] (03PS1) 10BBlack: Add digicert-2022 to available unified set [puppet] - 10https://gerrit.wikimedia.org/r/850286 (https://phabricator.wikimedia.org/T313328) [22:14:41] (03PS1) 10BBlack: Switch drmrs, eqsin, esams to digicert-2022 [puppet] - 10https://gerrit.wikimedia.org/r/850287 (https://phabricator.wikimedia.org/T313328) [22:14:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [22:16:37] (03CR) 10Andrew Bogott: "VM pcc results: https://puppet-compiler.wmflabs.org/pcc-worker1003/37824/" [puppet] - 10https://gerrit.wikimedia.org/r/849473 (owner: 10David Caro) [22:16:48] (03CR) 10CI reject: [V: 04-1] Add digicert-2022 unified cert files [puppet] - 10https://gerrit.wikimedia.org/r/850285 (https://phabricator.wikimedia.org/T313328) (owner: 10BBlack) [22:17:40] ^ 22:15:07 The following are missing a SPDX licence header: [22:17:45] really? [22:18:47] (03CR) 10BBlack: [V: 03+2 C: 03+2] Add digicert-2022 unified cert files [puppet] - 10https://gerrit.wikimedia.org/r/850285 (https://phabricator.wikimedia.org/T313328) (owner: 10BBlack) [22:19:03] (ProbeDown) firing: (2) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:24:03] (ProbeDown) resolved: (2) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:28:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P36924 and previous config saved to /var/cache/conftool/dbconfig/20221027-222844-ladsgroup.json [22:28:59] (KubernetesAPILatency) firing: (12) High Kubernetes API latency (LIST certificates) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:37:21] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (Kanban): Replace labstore100[67] with clouddumps100[12] - https://phabricator.wikimedia.org/T309346 (10Andrew) 05Open→03Resolved [22:37:26] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10Andrew) [22:43:21] (03PS1) 10Andrew Bogott: Purge the last few references to labstore100[67] [puppet] - 10https://gerrit.wikimedia.org/r/850296 (https://phabricator.wikimedia.org/T319217) [22:43:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T318950)', diff saved to https://phabricator.wikimedia.org/P36925 and previous config saved to /var/cache/conftool/dbconfig/20221027-224350-ladsgroup.json [22:43:53] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1128.eqiad.wmnet with reason: Maintenance [22:43:59] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [22:44:07] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1128.eqiad.wmnet with reason: Maintenance [22:44:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1128 (T318950)', diff saved to https://phabricator.wikimedia.org/P36926 and previous config saved to /var/cache/conftool/dbconfig/20221027-224413-ladsgroup.json [22:44:53] !log andrew@cumin1001 START - Cookbook sre.hosts.decommission for hosts labstore1007 [22:47:16] (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic1089-production-search-omega-eqiad is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [22:49:39] !log andrew@cumin1001 START - Cookbook sre.dns.netbox [22:50:38] PROBLEM - SSH on mw1326.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:51:44] !log andrew@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:51:45] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts labstore1007 [22:53:21] !log andrew@cumin1001 START - Cookbook sre.hosts.decommission for hosts labstore1006 [22:53:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128 (T318950)', diff saved to https://phabricator.wikimedia.org/P36927 and previous config saved to /var/cache/conftool/dbconfig/20221027-225322-ladsgroup.json [22:53:31] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [22:57:32] !log andrew@cumin1001 START - Cookbook sre.dns.netbox [23:00:34] !log andrew@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [23:00:35] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts labstore1006 [23:01:27] (03CR) 10Andrew Bogott: [C: 03+2] Purge the last few references to labstore100[67] [puppet] - 10https://gerrit.wikimedia.org/r/850296 (https://phabricator.wikimedia.org/T319217) (owner: 10Andrew Bogott) [23:04:37] 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission labstore100[67].wikimedia.org - https://phabricator.wikimedia.org/T319217 (10Andrew) a:05Andrew→03Jclark-ctr [23:04:43] 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission labstore100[67].wikimedia.org - https://phabricator.wikimedia.org/T319217 (10Andrew) btw, I believe each of these servers is attached to an external disk shelf -- those shelves should also be decom'd. [23:08:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128', diff saved to https://phabricator.wikimedia.org/P36928 and previous config saved to /var/cache/conftool/dbconfig/20221027-230828-ladsgroup.json [23:09:38] PROBLEM - SSH on mw1334.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:22:40] RECOVERY - SSH on mw1325.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:23:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128', diff saved to https://phabricator.wikimedia.org/P36929 and previous config saved to /var/cache/conftool/dbconfig/20221027-232335-ladsgroup.json [23:38:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128 (T318950)', diff saved to https://phabricator.wikimedia.org/P36930 and previous config saved to /var/cache/conftool/dbconfig/20221027-233842-ladsgroup.json [23:38:44] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1132.eqiad.wmnet with reason: Maintenance [23:38:49] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [23:38:58] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1132.eqiad.wmnet with reason: Maintenance [23:39:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1132 (T318950)', diff saved to https://phabricator.wikimedia.org/P36931 and previous config saved to /var/cache/conftool/dbconfig/20221027-233903-ladsgroup.json [23:41:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132 (T318950)', diff saved to https://phabricator.wikimedia.org/P36932 and previous config saved to /var/cache/conftool/dbconfig/20221027-234111-ladsgroup.json [23:51:09] RECOVERY - SSH on mw1326.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:56:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132', diff saved to https://phabricator.wikimedia.org/P36933 and previous config saved to /var/cache/conftool/dbconfig/20221027-235618-ladsgroup.json [23:58:38] (03PS2) 10Ssingh: cp4040: update site.pp and related configs for cp (text) role [puppet] - 10https://gerrit.wikimedia.org/r/850273 (https://phabricator.wikimedia.org/T317244) [23:59:40] (03CR) 10Ssingh: [C: 03+2] cp4040: update site.pp and related configs for cp (text) role [puppet] - 10https://gerrit.wikimedia.org/r/850273 (https://phabricator.wikimedia.org/T317244) (owner: 10Ssingh)