[00:02:45] (JobUnavailable) firing: (5) Reduced availability for job calico-felix in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:09:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2119', diff saved to https://phabricator.wikimedia.org/P39625 and previous config saved to /var/cache/conftool/dbconfig/20221115-000935-marostegui.json [00:10:02] (03PS1) 10Stang: fiwiktionary: Add rollbacker group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/856705 (https://phabricator.wikimedia.org/T323063) [00:10:43] PROBLEM - SSH on mw1310.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:10:58] (KubernetesRsyslogDown) firing: rsyslog on aux-k8s-worker1002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=aux-k8s-worker1002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [00:11:23] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on arclamp2001.codfw.wmnet with reason: host reimage [00:12:45] (JobUnavailable) firing: (5) Reduced availability for job calico-felix in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:13:18] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host puppetdb2003.codfw.wmnet with OS bullseye [00:13:23] 10SRE, 10ops-codfw, 10DC-Ops: Q2:rack/setup/install puppetdb2003 - https://phabricator.wikimedia.org/T317894 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host puppetdb2003.codfw.wmnet with OS bullseye completed: - puppetdb2003 (**PASS**) - Removed from Puppet... [00:13:51] RECOVERY - SSH on aux-k8s-ctrl1002 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [00:14:43] (KubernetesCalicoDown) firing: (2) aux-k8s-ctrl1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [00:14:58] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on arclamp2001.codfw.wmnet with reason: host reimage [00:17:45] (JobUnavailable) resolved: (5) Reduced availability for job calico-felix in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:19:07] (03Abandoned) 10Stang: tnwiki: Add extendedconfirmed group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830861 (https://phabricator.wikimedia.org/T317276) (owner: 10Stang) [00:20:58] (KubernetesRsyslogDown) resolved: rsyslog on aux-k8s-worker1002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=aux-k8s-worker1002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [00:21:45] (JobUnavailable) firing: Reduced availability for job k8s-api in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:24:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2119 (T321130)', diff saved to https://phabricator.wikimedia.org/P39626 and previous config saved to /var/cache/conftool/dbconfig/20221115-002441-marostegui.json [00:24:44] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2136.codfw.wmnet with reason: Maintenance [00:24:47] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [00:25:08] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2136.codfw.wmnet with reason: Maintenance [00:25:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2136 (T321130)', diff saved to https://phabricator.wikimedia.org/P39627 and previous config saved to /var/cache/conftool/dbconfig/20221115-002514-marostegui.json [00:26:45] (JobUnavailable) resolved: Reduced availability for job k8s-api in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:28:30] PROBLEM - SSH on aux-k8s-ctrl1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [00:28:58] RECOVERY - SSH on aux-k8s-ctrl1001 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [00:29:17] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host arclamp2001.codfw.wmnet with OS bullseye [00:29:23] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install arclamp2001.codfw.wmnet - https://phabricator.wikimedia.org/T319428 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host arclamp2001.codfw.wmnet with OS bullseye completed: - arclamp2001 (**PASS**)... [00:29:45] (JobUnavailable) firing: Reduced availability for job k8s-api in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:33:37] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4052.ulsfo.wmnet with OS buster [00:34:12] (KubernetesCalicoDown) firing: (2) aux-k8s-ctrl1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [00:34:39] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 227, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:34:43] (KubernetesCalicoDown) resolved: (2) aux-k8s-ctrl1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [00:36:59] PROBLEM - SSH on aux-k8s-ctrl1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [00:37:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2136 (T321130)', diff saved to https://phabricator.wikimedia.org/P39628 and previous config saved to /var/cache/conftool/dbconfig/20221115-003732-marostegui.json [00:37:38] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [00:38:03] RECOVERY - SSH on aux-k8s-ctrl1001 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [00:38:11] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [00:39:25] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [00:39:45] (JobUnavailable) resolved: Reduced availability for job k8s-api in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:41:45] (JobUnavailable) firing: Reduced availability for job k8s-api in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:42:05] PROBLEM - SSH on aux-k8s-ctrl1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [00:42:21] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 90, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:45:28] (KubernetesCalicoDown) firing: (2) aux-k8s-ctrl1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [00:46:45] (JobUnavailable) resolved: Reduced availability for job k8s-api in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:48:58] (KubernetesAPILatency) firing: (14) High Kubernetes API latency (LIST cronjobs) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=aux-k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [00:52:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2136', diff saved to https://phabricator.wikimedia.org/P39629 and previous config saved to /var/cache/conftool/dbconfig/20221115-005238-marostegui.json [00:52:45] (JobUnavailable) firing: Reduced availability for job k8s-node-cadvisor in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:53:58] (KubernetesAPILatency) resolved: (14) High Kubernetes API latency (LIST cronjobs) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=aux-k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [00:55:28] (KubernetesCalicoDown) firing: (2) aux-k8s-ctrl1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [00:56:19] 10SRE, 10ops-codfw, 10DC-Ops: Q2:rack/setup/install puppetdb2003 - https://phabricator.wikimedia.org/T317894 (10Papaul) [00:57:27] 10SRE, 10ops-codfw, 10DC-Ops: Q2:rack/setup/install puppetdb2003 - https://phabricator.wikimedia.org/T317894 (10Papaul) 05Open→03Resolved @MoritzMuehlenhoff @jbond this is complete [00:57:45] (JobUnavailable) firing: (5) Reduced availability for job calico-felix in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:02:20] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4052.ulsfo.wmnet with reason: host reimage [01:05:43] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4052.ulsfo.wmnet with reason: host reimage [01:05:58] (KubernetesRsyslogDown) firing: (2) rsyslog on aux-k8s-worker1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [01:07:45] (JobUnavailable) firing: (5) Reduced availability for job calico-felix in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:07:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2136', diff saved to https://phabricator.wikimedia.org/P39630 and previous config saved to /var/cache/conftool/dbconfig/20221115-010745-marostegui.json [01:09:47] RECOVERY - SSH on aux-k8s-ctrl1002 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [01:10:35] RECOVERY - SSH on mw1310.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:10:58] (KubernetesRsyslogDown) resolved: (2) rsyslog on aux-k8s-worker1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [01:12:43] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install arclamp2001.codfw.wmnet - https://phabricator.wikimedia.org/T319428 (10Papaul) [01:12:45] (JobUnavailable) firing: (5) Reduced availability for job calico-felix in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:13:05] PROBLEM - SSH on aux-k8s-ctrl1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [01:13:37] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install arclamp2001.codfw.wmnet - https://phabricator.wikimedia.org/T319428 (10Papaul) 05Open→03Resolved This is complete [01:20:51] RECOVERY - SSH on aux-k8s-ctrl1002 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [01:22:43] (KubernetesCalicoDown) firing: (2) aux-k8s-ctrl1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [01:22:45] (JobUnavailable) firing: (3) Reduced availability for job calico-felix in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:22:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2136 (T321130)', diff saved to https://phabricator.wikimedia.org/P39631 and previous config saved to /var/cache/conftool/dbconfig/20221115-012251-marostegui.json [01:22:53] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2137.codfw.wmnet with reason: Maintenance [01:22:57] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [01:23:07] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2137.codfw.wmnet with reason: Maintenance [01:23:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2137:3314 (T321130)', diff saved to https://phabricator.wikimedia.org/P39632 and previous config saved to /var/cache/conftool/dbconfig/20221115-012313-marostegui.json [01:27:45] (JobUnavailable) resolved: (3) Reduced availability for job calico-felix in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:28:04] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4052.ulsfo.wmnet with OS buster [01:32:43] (KubernetesCalicoDown) firing: (2) aux-k8s-ctrl1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [01:32:45] (JobUnavailable) firing: (3) Reduced availability for job k8s-api in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:33:46] RECOVERY - SSH on aux-k8s-ctrl1001 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [01:35:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314 (T321130)', diff saved to https://phabricator.wikimedia.org/P39633 and previous config saved to /var/cache/conftool/dbconfig/20221115-013510-marostegui.json [01:35:16] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [01:37:16] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 149 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [01:37:45] (JobUnavailable) firing: (9) Reduced availability for job calico-felix in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:37:48] RECOVERY - SSH on mw1328.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:38:28] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [01:40:58] PROBLEM - SSH on aux-k8s-ctrl1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [01:42:45] (JobUnavailable) firing: (10) Reduced availability for job calico-felix in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:47:20] PROBLEM - SSH on aux-k8s-ctrl1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [01:47:45] (JobUnavailable) firing: (13) Reduced availability for job calico-felix in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:50:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314', diff saved to https://phabricator.wikimedia.org/P39634 and previous config saved to /var/cache/conftool/dbconfig/20221115-015017-marostegui.json [01:50:52] PROBLEM - Check systemd state on aux-k8s-ctrl1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-journal-flush.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:52:43] (KubernetesCalicoDown) firing: (2) aux-k8s-ctrl1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [01:52:45] (JobUnavailable) firing: (15) Reduced availability for job calico-felix in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:55:28] (KubernetesCalicoDown) firing: (2) aux-k8s-ctrl1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [01:56:16] RECOVERY - SSH on aux-k8s-ctrl1001 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [01:57:14] RECOVERY - SSH on aux-k8s-ctrl1002 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [01:57:43] (KubernetesCalicoDown) firing: (2) aux-k8s-ctrl1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [01:57:54] PROBLEM - Check systemd state on aux-k8s-ctrl1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-journal-flush.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:00:28] (KubernetesCalicoDown) resolved: (2) aux-k8s-ctrl1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [02:01:18] PROBLEM - SSH on mw1337.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:02:45] (JobUnavailable) firing: (13) Reduced availability for job calico-felix in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:03:58] (KubernetesAPILatency) firing: High Kubernetes API latency (POST events) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=aux-k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:05:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314', diff saved to https://phabricator.wikimedia.org/P39635 and previous config saved to /var/cache/conftool/dbconfig/20221115-020523-marostegui.json [02:07:06] PROBLEM - SSH on aux-k8s-ctrl1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [02:07:28] (KubernetesCalicoDown) firing: (2) aux-k8s-ctrl1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [02:07:45] (JobUnavailable) firing: (12) Reduced availability for job calico-felix in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:08:58] (KubernetesAPILatency) resolved: (7) High Kubernetes API latency (GET clusterinformations) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=aux-k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:17:45] (JobUnavailable) resolved: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:18:04] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 226, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:18:42] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 89, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:20:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314 (T321130)', diff saved to https://phabricator.wikimedia.org/P39636 and previous config saved to /var/cache/conftool/dbconfig/20221115-022030-marostegui.json [02:20:32] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2138.codfw.wmnet with reason: Maintenance [02:20:37] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [02:20:46] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2138.codfw.wmnet with reason: Maintenance [02:20:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2138:3314 (T321130)', diff saved to https://phabricator.wikimedia.org/P39637 and previous config saved to /var/cache/conftool/dbconfig/20221115-022052-marostegui.json [02:29:20] PROBLEM - Check systemd state on aux-k8s-ctrl1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-journal-flush.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:32:28] (KubernetesCalicoDown) resolved: aux-k8s-ctrl1002.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Faux-k8s&var-instance=aux-k8s-ctrl1002.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [02:33:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314 (T321130)', diff saved to https://phabricator.wikimedia.org/P39638 and previous config saved to /var/cache/conftool/dbconfig/20221115-023307-marostegui.json [02:33:12] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [02:36:26] RECOVERY - SSH on aux-k8s-ctrl1002 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [02:38:58] (KubernetesCalicoDown) firing: aux-k8s-ctrl1002.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Faux-k8s&var-instance=aux-k8s-ctrl1002.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [02:42:24] PROBLEM - SSH on aux-k8s-ctrl1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [02:48:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314', diff saved to https://phabricator.wikimedia.org/P39639 and previous config saved to /var/cache/conftool/dbconfig/20221115-024813-marostegui.json [03:00:04] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221115T0300) [03:02:08] RECOVERY - SSH on mw1337.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:02:54] PROBLEM - Check systemd state on aux-k8s-ctrl1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-journal-flush.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:03:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314', diff saved to https://phabricator.wikimedia.org/P39640 and previous config saved to /var/cache/conftool/dbconfig/20221115-030320-marostegui.json [03:07:02] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [03:07:44] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.40.0-wmf.10 [core] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/856495 (https://phabricator.wikimedia.org/T320515) [03:07:46] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.40.0-wmf.10 [core] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/856495 (https://phabricator.wikimedia.org/T320515) (owner: 10TrainBranchBot) [03:12:46] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [03:12:47] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [03:16:58] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [03:18:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314 (T321130)', diff saved to https://phabricator.wikimedia.org/P39641 and previous config saved to /var/cache/conftool/dbconfig/20221115-031826-marostegui.json [03:18:29] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2139.codfw.wmnet with reason: Maintenance [03:18:32] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [03:18:42] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2139.codfw.wmnet with reason: Maintenance [03:21:02] RECOVERY - dump of matomo in eqiad on backupmon1001 is OK: Last dump for matomo at eqiad (db1108) taken on 2022-11-15 03:11:35 (260 MiB, -3.3 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [03:24:36] (03Merged) 10jenkins-bot: Branch commit for wmf/1.40.0-wmf.10 [core] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/856495 (https://phabricator.wikimedia.org/T320515) (owner: 10TrainBranchBot) [03:29:10] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2147.codfw.wmnet with reason: Maintenance [03:29:23] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2147.codfw.wmnet with reason: Maintenance [03:29:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2147 (T321130)', diff saved to https://phabricator.wikimedia.org/P39642 and previous config saved to /var/cache/conftool/dbconfig/20221115-032929-marostegui.json [03:29:34] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [03:29:58] RECOVERY - SSH on aux-k8s-ctrl1002 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [03:32:11] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [03:32:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [03:32:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [03:33:22] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [03:38:40] PROBLEM - Check systemd state on aux-k8s-ctrl1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-journal-flush.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:38:58] (KubernetesCalicoDown) resolved: aux-k8s-ctrl1002.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Faux-k8s&var-instance=aux-k8s-ctrl1002.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [03:41:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T321130)', diff saved to https://phabricator.wikimedia.org/P39643 and previous config saved to /var/cache/conftool/dbconfig/20221115-034127-marostegui.json [03:41:33] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [03:42:36] PROBLEM - Check systemd state on aux-k8s-ctrl1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-journal-flush.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:56:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P39644 and previous config saved to /var/cache/conftool/dbconfig/20221115-035634-marostegui.json [04:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221115T0400) [04:01:17] (03PS1) 10TrainBranchBot: testwikis wikis to 1.40.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/856721 (https://phabricator.wikimedia.org/T320515) [04:01:19] (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.40.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/856721 (https://phabricator.wikimedia.org/T320515) (owner: 10TrainBranchBot) [04:02:03] (03Merged) 10jenkins-bot: testwikis wikis to 1.40.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/856721 (https://phabricator.wikimedia.org/T320515) (owner: 10TrainBranchBot) [04:02:31] !log mwpresync@deploy1002 Started scap: testwikis wikis to 1.40.0-wmf.10 refs T320515 [04:02:35] T320515: 1.40.0-wmf.10 deployment blockers - https://phabricator.wikimedia.org/T320515 [04:03:52] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [04:05:28] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: eventlogging_to_druid_editattemptstep_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:07:38] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 90, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:08:48] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 227, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:09:53] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [04:09:54] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [04:10:35] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [04:11:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P39645 and previous config saved to /var/cache/conftool/dbconfig/20221115-041140-marostegui.json [04:15:38] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [04:16:38] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [04:16:39] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [04:17:30] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [04:21:24] PROBLEM - SSH on aux-k8s-ctrl1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [04:21:58] (KubernetesCalicoDown) firing: aux-k8s-ctrl1002.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Faux-k8s&var-instance=aux-k8s-ctrl1002.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [04:26:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2147 (T321130)', diff saved to https://phabricator.wikimedia.org/P39646 and previous config saved to /var/cache/conftool/dbconfig/20221115-042647-marostegui.json [04:26:50] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2155.codfw.wmnet with reason: Maintenance [04:26:52] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [04:26:58] (KubernetesCalicoDown) resolved: aux-k8s-ctrl1002.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Faux-k8s&var-instance=aux-k8s-ctrl1002.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [04:27:03] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2155.codfw.wmnet with reason: Maintenance [04:27:04] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db2095.codfw.wmnet with reason: Maintenance [04:27:07] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2095.codfw.wmnet with reason: Maintenance [04:27:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2155 (T321130)', diff saved to https://phabricator.wikimedia.org/P39647 and previous config saved to /var/cache/conftool/dbconfig/20221115-042713-marostegui.json [04:27:18] RECOVERY - SSH on aux-k8s-ctrl1002 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [04:28:58] (KubernetesCalicoDown) firing: aux-k8s-ctrl1002.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Faux-k8s&var-instance=aux-k8s-ctrl1002.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [04:33:20] PROBLEM - SSH on aux-k8s-ctrl1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [04:37:42] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [04:38:45] !log mwpresync@deploy1002 Finished scap: testwikis wikis to 1.40.0-wmf.10 refs T320515 (duration: 36m 14s) [04:38:49] T320515: 1.40.0-wmf.10 deployment blockers - https://phabricator.wikimedia.org/T320515 [04:40:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T321130)', diff saved to https://phabricator.wikimedia.org/P39648 and previous config saved to /var/cache/conftool/dbconfig/20221115-044002-marostegui.json [04:40:07] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [04:40:42] PROBLEM - SSH on mw1328.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:40:48] !log mwpresync@deploy1002 Pruned MediaWiki: 1.40.0-wmf.7 (duration: 02m 01s) [04:48:04] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [04:49:04] RECOVERY - SSH on aux-k8s-ctrl1002 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [04:50:40] PROBLEM - Host parse1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [04:51:46] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64600/IPv4: Active - PyBal, AS64605/IPv4: Active - Anycast, AS64600/IPv4: Connect - PyBal, AS64605/IPv4: Connect - Anycast, AS64605/IPv4: Connect - Anycast, AS64605/IPv6: Connect - Anycast, AS64605/IPv4: Connect - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv6: Connect - Anycast, AS64605/IPv4: Active - Any [04:51:46] ps://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [04:51:56] PROBLEM - Host cp5004 is DOWN: PING CRITICAL - Packet loss = 100% [04:51:56] PROBLEM - Host cp5016 is DOWN: PING CRITICAL - Packet loss = 100% [04:52:16] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv6: Connect - Anycast, AS64600/IPv4: Connect - PyBal, AS64605/IPv6: Connect - Anycast, AS64600/IPv4: Connect - PyBal, AS64605/IPv4: Connect - Anycast, AS64605/IPv4: Connect - Anycast, AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [04:52:16] PROBLEM - Host cp5010 is DOWN: PING CRITICAL - Packet loss = 100% [04:52:20] PROBLEM - Host ncredir5002 is DOWN: PING CRITICAL - Packet loss = 100% [04:52:44] PROBLEM - Host cp5006 is DOWN: PING CRITICAL - Packet loss = 100% [04:52:44] PROBLEM - Host cp5008 is DOWN: PING CRITICAL - Packet loss = 100% [04:52:44] PROBLEM - Host cp5002 is DOWN: PING CRITICAL - Packet loss = 100% [04:52:51] (Emergency syslog message) firing: Alert for device asw1-eqsin.mgmt.eqsin.wmnet - Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [04:52:52] ok, thats not good [04:52:54] PROBLEM - Host durum5001 is DOWN: PING CRITICAL - Packet loss = 100% [04:53:10] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:53:19] (ProbeDown) firing: Service text-https:443 has failed probes (http_text-https_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:53:20] checkign if its msw outage [04:53:27] seems likely more than that [04:53:28] PROBLEM - Host cp5014 is DOWN: PING CRITICAL - Packet loss = 100% [04:53:32] PROBLEM - Host lvs5002 is DOWN: PING CRITICAL - Packet loss = 100% [04:53:32] PROBLEM - Host ganeti5002 is DOWN: PING CRITICAL - Packet loss = 100% [04:53:32] PROBLEM - Host durum5002 is DOWN: PING CRITICAL - Packet loss = 100% [04:53:32] PROBLEM - Host cp5012 is DOWN: PING CRITICAL - Packet loss = 100% [04:53:32] PROBLEM - Host dns5002 is DOWN: PING CRITICAL - Packet loss = 100% [04:53:38] RECOVERY - Host ncredir5002 is UP: PING WARNING - Packet loss = 66%, RTA = 281.59 ms [04:53:40] RECOVERY - Host cp5014 is UP: PING OK - Packet loss = 0%, RTA = 307.26 ms [04:53:40] RECOVERY - Host cp5016 is UP: PING OK - Packet loss = 0%, RTA = 318.83 ms [04:53:40] RECOVERY - Host cp5008 is UP: PING OK - Packet loss = 0%, RTA = 244.87 ms [04:53:40] RECOVERY - Host cp5010 is UP: PING OK - Packet loss = 0%, RTA = 292.63 ms [04:53:40] RECOVERY - Host cp5006 is UP: PING OK - Packet loss = 0%, RTA = 243.99 ms [04:53:40] RECOVERY - Host cp5004 is UP: PING OK - Packet loss = 0%, RTA = 292.88 ms [04:53:40] RECOVERY - Host cp5002 is UP: PING OK - Packet loss = 0%, RTA = 309.55 ms [04:53:41] RECOVERY - Host cp5012 is UP: PING OK - Packet loss = 0%, RTA = 324.95 ms [04:53:42] RECOVERY - Host dns5002 is UP: PING OK - Packet loss = 0%, RTA = 243.95 ms [04:53:42] RECOVERY - Host ganeti5002 is UP: PING OK - Packet loss = 0%, RTA = 240.05 ms [04:53:42] RECOVERY - Host durum5001 is UP: PING OK - Packet loss = 0%, RTA = 323.56 ms [04:53:50] RECOVERY - Host durum5002 is UP: PING OK - Packet loss = 0%, RTA = 242.12 ms [04:54:26] RECOVERY - Host lvs5002 is UP: PING OK - Packet loss = 0%, RTA = 294.04 ms [04:54:26] (ProbeDown) firing: (3) Service text-https:443 has failed probes (http_text-https_ip6) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:54:45] Ok, that was too fast a flap for anything but an msw or asw reboot [04:55:00] we have ongoign site work there so this wasn't expected but isn't unforseen. [04:55:08] PROBLEM - SSH on aux-k8s-ctrl1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [04:55:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P39649 and previous config saved to /var/cache/conftool/dbconfig/20221115-045508-marostegui.json [04:56:42] RECOVERY - Host parse1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.74 ms [04:59:50] here, looks over -- if it happens again I'll depool per s.ukhe's email [04:59:53] (ProbeDown) resolved: Service text-https:443 has failed probes (http_text-https_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:00:03] (virtual-chassis crash) firing: Alert for device asw1-eqsin.mgmt.eqsin.wmnet - virtual-chassis crash - https://alerts.wikimedia.org/?q=alertname%3Dvirtual-chassis+crash [05:00:41] (ProbeDown) resolved: (3) Service text-https:443 has failed probes (http_text-https_ip6) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:00:48] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:05:38] (Emergency syslog message) resolved: Device asw1-eqsin.mgmt.eqsin.wmnet recovered from Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [05:06:03] (KubernetesRsyslogDown) firing: rsyslog on aux-k8s-worker1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=aux-k8s-worker1001 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [05:08:45] (virtual-chassis crash) resolved: Device asw1-eqsin.mgmt.eqsin.wmnet recovered from virtual-chassis crash - https://alerts.wikimedia.org/?q=alertname%3Dvirtual-chassis+crash [05:10:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P39650 and previous config saved to /var/cache/conftool/dbconfig/20221115-051015-marostegui.json [05:10:16] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:10:20] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:10:58] (KubernetesRsyslogDown) resolved: rsyslog on aux-k8s-worker1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=aux-k8s-worker1001 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [05:13:20] !log ~5AM UTC when plugging a new host into asw1-ulsfo, the virtual chassis crashed and rebooted, causing loss of connectivity to hosts for a very short period [05:13:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:14:04] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48975 bytes in 0.132 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:14:10] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.228 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:22:20] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10RobH) [05:23:42] PROBLEM - Check systemd state on aux-k8s-ctrl1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-journal-flush.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:25:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T321130)', diff saved to https://phabricator.wikimedia.org/P39651 and previous config saved to /var/cache/conftool/dbconfig/20221115-052521-marostegui.json [05:25:23] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2172.codfw.wmnet with reason: Maintenance [05:25:27] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [05:25:37] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2172.codfw.wmnet with reason: Maintenance [05:25:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2172 (T321130)', diff saved to https://phabricator.wikimedia.org/P39652 and previous config saved to /var/cache/conftool/dbconfig/20221115-052543-marostegui.json [05:26:48] RECOVERY - SSH on aux-k8s-ctrl1002 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [05:32:48] PROBLEM - SSH on aux-k8s-ctrl1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [05:37:28] PROBLEM - Check systemd state on aux-k8s-ctrl1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-journal-flush.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:38:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T321130)', diff saved to https://phabricator.wikimedia.org/P39653 and previous config saved to /var/cache/conftool/dbconfig/20221115-053819-marostegui.json [05:38:24] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [05:38:34] RECOVERY - SSH on aux-k8s-ctrl1002 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [05:43:58] (KubernetesCalicoDown) resolved: aux-k8s-ctrl1002.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Faux-k8s&var-instance=aux-k8s-ctrl1002.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [05:53:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P39654 and previous config saved to /var/cache/conftool/dbconfig/20221115-055326-marostegui.json [06:05:26] !log stevemunene@cumin1001 START - Cookbook sre.hosts.reboot-single for host stat1005.eqiad.wmnet [06:05:40] !log stevemunene@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host stat1005.eqiad.wmnet [06:06:31] !log stevemunene@cumin1001 START - Cookbook sre.hosts.reboot-single for host stat1005.eqiad.wmnet [06:08:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P39655 and previous config saved to /var/cache/conftool/dbconfig/20221115-060832-marostegui.json [06:09:02] !log stevemunene@cumin1001 START - Cookbook sre.hosts.reboot-single for host karapace1001.eqiad.wmnet [06:13:15] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host karapace1001.eqiad.wmnet [06:15:34] !log stevemunene@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-tool1007.eqiad.wmnet [06:18:13] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10RobH) [06:18:14] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host stat1005.eqiad.wmnet [06:18:28] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10RobH) [06:19:27] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-tool1007.eqiad.wmnet [06:20:05] !log stevemunene@cumin1001 START - Cookbook sre.hosts.reboot-single for host stat1008.eqiad.wmnet [06:23:39] !log robh@cumin1001 START - Cookbook sre.dns.netbox [06:23:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T321130)', diff saved to https://phabricator.wikimedia.org/P39656 and previous config saved to /var/cache/conftool/dbconfig/20221115-062339-marostegui.json [06:23:41] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2179.codfw.wmnet with reason: Maintenance [06:23:46] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [06:23:54] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2179.codfw.wmnet with reason: Maintenance [06:24:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2179 (T321130)', diff saved to https://phabricator.wikimedia.org/P39657 and previous config saved to /var/cache/conftool/dbconfig/20221115-062400-marostegui.json [06:25:37] !log robh@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [06:26:02] !log robh@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cp5032 [06:26:28] !log robh@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp5032 [06:26:52] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10RobH) [06:30:20] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host stat1008.eqiad.wmnet [06:36:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179 (T321130)', diff saved to https://phabricator.wikimedia.org/P39658 and previous config saved to /var/cache/conftool/dbconfig/20221115-063642-marostegui.json [06:36:48] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [06:38:58] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10RobH) [06:42:32] RECOVERY - SSH on mw1328.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:43:04] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:43:12] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:43:16] !log all in rack work in eqsin is complete for today [06:43:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:50] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48975 bytes in 0.113 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:46:56] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.264 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:51:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P39659 and previous config saved to /var/cache/conftool/dbconfig/20221115-065149-marostegui.json [07:00:05] kormat, marostegui, and Amir1: gettimeofday() says it's time for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221115T0700) [07:05:34] PROBLEM - SSH on aux-k8s-ctrl1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:05:45] (JobUnavailable) firing: Reduced availability for job k8s-api in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:06:02] PROBLEM - SSH on mw1337.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:06:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P39660 and previous config saved to /var/cache/conftool/dbconfig/20221115-070655-marostegui.json [07:10:45] (JobUnavailable) resolved: Reduced availability for job k8s-api in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:11:58] (KubernetesCalicoDown) firing: aux-k8s-ctrl1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Faux-k8s&var-instance=aux-k8s-ctrl1001.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [07:15:00] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:20:52] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:22:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179 (T321130)', diff saved to https://phabricator.wikimedia.org/P39661 and previous config saved to /var/cache/conftool/dbconfig/20221115-072202-marostegui.json [07:22:07] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [07:31:04] RECOVERY - SSH on aux-k8s-ctrl1001 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:31:58] (KubernetesCalicoDown) resolved: aux-k8s-ctrl1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Faux-k8s&var-instance=aux-k8s-ctrl1001.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [07:37:43] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Sync to 6.6.2 of the CAS overlay [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/856616 (owner: 10Muehlenhoff) [07:38:49] (03PS2) 10Muehlenhoff: Remove leftover entry for bast4002 [puppet] - 10https://gerrit.wikimedia.org/r/856522 [07:41:24] (03CR) 10Muehlenhoff: [C: 03+2] Remove leftover entry for bast4002 [puppet] - 10https://gerrit.wikimedia.org/r/856522 (owner: 10Muehlenhoff) [07:41:39] (03CR) 10Muehlenhoff: [C: 03+2] Remove puppet leftovers [puppet] - 10https://gerrit.wikimedia.org/r/856525 (https://phabricator.wikimedia.org/T306840) (owner: 10Muehlenhoff) [07:41:47] (03PS2) 10Muehlenhoff: Remove puppet leftovers [puppet] - 10https://gerrit.wikimedia.org/r/856525 (https://phabricator.wikimedia.org/T306840) [07:43:43] (03PS2) 10Muehlenhoff: Remove leftover Puppet entry [puppet] - 10https://gerrit.wikimedia.org/r/856524 (https://phabricator.wikimedia.org/T292075) [07:48:11] (03CR) 10Muehlenhoff: [C: 03+2] Remove leftover Puppet entry [puppet] - 10https://gerrit.wikimedia.org/r/856524 (https://phabricator.wikimedia.org/T292075) (owner: 10Muehlenhoff) [07:53:40] (03PS4) 10David Caro: wmcs.toolforge.grid: get also the job logs [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/841930 [07:57:36] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1021.eqiad.wmnet [08:00:04] Amir1 and Urbanecm: Your horoscope predicts another unfortunate UTC morning backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221115T0800). [08:00:05] No Gerrit patches in the queue for this window AFAICS. [08:00:15] (03CR) 10David Caro: [C: 03+1] ceph: osd: factorize config read (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/856674 (owner: 10Arturo Borrero Gonzalez) [08:05:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1021.eqiad.wmnet [08:05:57] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1021.eqiad.wmnet to cluster eqiad and group D [08:06:52] RECOVERY - SSH on mw1337.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:06:58] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti1021.eqiad.wmnet to cluster eqiad and group D [08:22:04] 10ops-eqsin, 10Infrastructure-Foundations, 10netops: asw1-eqsin: VC mastership change - https://phabricator.wikimedia.org/T323094 (10ayounsi) p:05Triage→03High [08:22:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=aux-k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:24:46] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10MoritzMuehlenhoff) [08:25:40] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:27:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=aux-k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:28:02] (03CR) 10Hashar: [C: 03+2] Gerrit v3.5.4 and rebuild plugins [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/824200 (https://phabricator.wikimedia.org/T307334) (owner: 10Hashar) [08:28:33] (03Merged) 10jenkins-bot: Gerrit v3.5.4 and rebuild plugins [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/824200 (https://phabricator.wikimedia.org/T307334) (owner: 10Hashar) [08:28:35] 10ops-eqsin, 10Infrastructure-Foundations, 10netops: asw1-eqsin: VC mastership change - https://phabricator.wikimedia.org/T323094 (10ayounsi) @RobH I couldn't find any open task so I opened this one, in the future please make sure a task is opened as soon as any issue happens. Can you also let us know the im... [08:30:54] (03CR) 10David Caro: "Have you given a thought on how to test the single nick setup on osds? (performance, node recovery, etc.)" [puppet] - 10https://gerrit.wikimedia.org/r/856675 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez) [08:43:05] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2140.codfw.wmnet with reason: Maintenance [08:43:18] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2140.codfw.wmnet with reason: Maintenance [08:44:01] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2114.codfw.wmnet with reason: Maintenance [08:44:15] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2114.codfw.wmnet with reason: Maintenance [08:44:44] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1173.eqiad.wmnet with reason: Maintenance [08:44:58] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1173.eqiad.wmnet with reason: Maintenance [08:45:48] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1102.eqiad.wmnet with reason: Maintenance [08:46:01] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1102.eqiad.wmnet with reason: Maintenance [08:46:17] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1105.eqiad.wmnet with reason: Maintenance [08:46:31] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1105.eqiad.wmnet with reason: Maintenance [08:46:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3312 (T321126)', diff saved to https://phabricator.wikimedia.org/P39662 and previous config saved to /var/cache/conftool/dbconfig/20221115-084637-marostegui.json [08:46:42] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [08:47:05] 10SRE, 10ops-eqsin, 10Infrastructure-Foundations, 10netops: asw1-eqsin: VC mastership change - https://phabricator.wikimedia.org/T323094 (10ayounsi) JTAC case 2022-1115-586910 opened. [08:47:31] (03CR) 10David Caro: "Mostly questions" [puppet] - 10https://gerrit.wikimedia.org/r/856544 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez) [08:49:52] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:49:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T321126)', diff saved to https://phabricator.wikimedia.org/P39663 and previous config saved to /var/cache/conftool/dbconfig/20221115-084959-marostegui.json [08:52:35] (03PS3) 10JMeybohm: k8s: Stop docker/runc spam from being written to syslog [puppet] - 10https://gerrit.wikimedia.org/r/855969 (https://phabricator.wikimedia.org/T307943) [08:52:37] (03PS3) 10JMeybohm: k8s: make profile::kubernetes::cluster_cidr mandatory [puppet] - 10https://gerrit.wikimedia.org/r/855997 (https://phabricator.wikimedia.org/T307943) [08:52:39] (03PS3) 10JMeybohm: k8s: Refactor profile::kubernetes::master::service_cluster_ip_range [puppet] - 10https://gerrit.wikimedia.org/r/855999 (https://phabricator.wikimedia.org/T307943) [08:52:41] (03PS2) 10JMeybohm: k8s: Add a central ipv6dualstack flag to enable dual stack [puppet] - 10https://gerrit.wikimedia.org/r/856589 (https://phabricator.wikimedia.org/T307943) [08:53:36] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1138.eqiad.wmnet with reason: Maintenance [08:53:49] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1138.eqiad.wmnet with reason: Maintenance [08:57:35] 10SRE-swift-storage, 10Maps, 10Product-Infrastructure-Team-Backlog, 10User-jijiki: Followups for Tegola and Swift interactions - https://phabricator.wikimedia.org/T307184 (10jijiki) [08:58:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:00:02] (03PS1) 10Majavah: P:wmcs::metricsinfra: use `pip install -e .` [puppet] - 10https://gerrit.wikimedia.org/r/856916 [09:00:04] (03PS1) 10Majavah: P:wmcs::metricsinfra: add timer to sync project list [puppet] - 10https://gerrit.wikimedia.org/r/856917 [09:00:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2162', diff saved to https://phabricator.wikimedia.org/P39664 and previous config saved to /var/cache/conftool/dbconfig/20221115-090058-root.json [09:02:52] (03CR) 10JMeybohm: [C: 03+2] k8s: Stop docker/runc spam from being written to syslog [puppet] - 10https://gerrit.wikimedia.org/r/855969 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [09:02:58] (03PS1) 10Marostegui: db2162: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/856919 (https://phabricator.wikimedia.org/T323040) [09:03:03] (03CR) 10JMeybohm: [C: 03+2] k8s: make profile::kubernetes::cluster_cidr mandatory [puppet] - 10https://gerrit.wikimedia.org/r/855997 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [09:03:08] (03CR) 10JMeybohm: [C: 03+2] k8s: Refactor profile::kubernetes::master::service_cluster_ip_range [puppet] - 10https://gerrit.wikimedia.org/r/855999 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [09:03:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:04:18] (03CR) 10Marostegui: [C: 03+2] db2162: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/856919 (https://phabricator.wikimedia.org/T323040) (owner: 10Marostegui) [09:05:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P39666 and previous config saved to /var/cache/conftool/dbconfig/20221115-090505-marostegui.json [09:05:28] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (PUT leases) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:06:03] (03PS1) 10Muehlenhoff: Remove bast5001 [puppet] - 10https://gerrit.wikimedia.org/r/856920 (https://phabricator.wikimedia.org/T323092) [09:07:07] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10ayounsi) @ssingh We have those 2 active alerts: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=cr3-ulsfo&service=BGP+status https:... [09:07:46] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts bast5001.wikimedia.org [09:10:28] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (PUT leases) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:11:53] 10SRE, 10ops-eqsin, 10Infrastructure-Foundations, 10netops: asw1-eqsin: VC mastership change - https://phabricator.wikimedia.org/T323094 (10Volans) I would also like to understand why eqsin was not immediately depooled when it happened. The patch was already ready on Gerrit. There was [[ https://grafana.wi... [09:12:42] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [09:14:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:14:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts bast5001.wikimedia.org [09:14:48] PROBLEM - SSH on mw1326.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:15:43] (03CR) 10Muehlenhoff: [C: 03+2] Remove bast5001 [puppet] - 10https://gerrit.wikimedia.org/r/856920 (https://phabricator.wikimedia.org/T323092) (owner: 10Muehlenhoff) [09:20:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P39667 and previous config saved to /var/cache/conftool/dbconfig/20221115-092011-marostegui.json [09:21:56] PROBLEM - SSH on aux-k8s-ctrl1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:22:45] (JobUnavailable) firing: Reduced availability for job k8s-api in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:23:45] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q1:rack/setup/install ganeti103[34] - https://phabricator.wikimedia.org/T314303 (10MoritzMuehlenhoff) ganeti1033/1034 have been added to the eqiad cluster (group D) [09:23:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [09:25:51] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "Ping me to merge this when you are ready at the keyboard." [puppet] - 10https://gerrit.wikimedia.org/r/856917 (owner: 10Majavah) [09:27:58] (KubernetesCalicoDown) firing: aux-k8s-ctrl1002.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Faux-k8s&var-instance=aux-k8s-ctrl1002.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [09:28:55] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [09:33:55] (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [09:34:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [09:35:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T321126)', diff saved to https://phabricator.wikimedia.org/P39668 and previous config saved to /var/cache/conftool/dbconfig/20221115-093518-marostegui.json [09:35:20] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1129.eqiad.wmnet with reason: Maintenance [09:35:23] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [09:35:33] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1129.eqiad.wmnet with reason: Maintenance [09:35:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1129 (T321126)', diff saved to https://phabricator.wikimedia.org/P39669 and previous config saved to /var/cache/conftool/dbconfig/20221115-093539-marostegui.json [09:36:10] !log draining ganeti1022 for eventual reimage T311687 [09:36:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:14] T311687: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 [09:36:37] (03CR) 10Elukey: "Left some notes after me Filippo and Joseph discussed the use case :)" [puppet] - 10https://gerrit.wikimedia.org/r/854499 (https://phabricator.wikimedia.org/T314981) (owner: 10Elukey) [09:37:58] (KubernetesCalicoDown) resolved: aux-k8s-ctrl1002.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Faux-k8s&var-instance=aux-k8s-ctrl1002.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [09:38:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T321126)', diff saved to https://phabricator.wikimedia.org/P39670 and previous config saved to /var/cache/conftool/dbconfig/20221115-093758-marostegui.json [09:38:28] PROBLEM - Check systemd state on aux-k8s-ctrl1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-journal-flush.service,systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:39:10] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [09:44:10] (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [09:44:58] (KubernetesCalicoDown) firing: aux-k8s-ctrl1002.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Faux-k8s&var-instance=aux-k8s-ctrl1002.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [09:45:33] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1199.eqiad.wmnet with reason: Maintenance [09:45:46] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1199.eqiad.wmnet with reason: Maintenance [09:45:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1199 (T321130)', diff saved to https://phabricator.wikimedia.org/P39671 and previous config saved to /var/cache/conftool/dbconfig/20221115-094552-marostegui.json [09:45:57] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [09:47:23] (03PS24) 10Elukey: centrallog: add first prototype of webrequest-live with Benthos [puppet] - 10https://gerrit.wikimedia.org/r/854499 (https://phabricator.wikimedia.org/T314981) [09:47:52] (03CR) 10Elukey: centrallog: add first prototype of webrequest-live with Benthos (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/854499 (https://phabricator.wikimedia.org/T314981) (owner: 10Elukey) [09:49:00] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38157/console" [puppet] - 10https://gerrit.wikimedia.org/r/854499 (https://phabricator.wikimedia.org/T314981) (owner: 10Elukey) [09:50:02] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [09:51:15] (03PS25) 10Elukey: centrallog: add first prototype of webrequest-live with Benthos [puppet] - 10https://gerrit.wikimedia.org/r/854499 (https://phabricator.wikimedia.org/T314981) [09:51:40] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [09:52:45] (JobUnavailable) resolved: Reduced availability for job k8s-api in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:53:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P39672 and previous config saved to /var/cache/conftool/dbconfig/20221115-095306-marostegui.json [09:58:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T321130)', diff saved to https://phabricator.wikimedia.org/P39673 and previous config saved to /var/cache/conftool/dbconfig/20221115-095843-marostegui.json [09:58:48] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [10:08:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P39674 and previous config saved to /var/cache/conftool/dbconfig/20221115-100812-marostegui.json [10:08:41] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence-Backup: Q1:rack/setup/install dbprov2004 - https://phabricator.wikimedia.org/T321128 (10jcrespo) @Papaul Yes, The RAID with the HDs should contain the OS, using the same custom recipe as db hosts (ideally that is sda, first hw raid virtual disk). The ssds ar... [10:09:22] (03PS2) 10Majavah: P:wmcs::metricsinfra: use `pip install -e .` [puppet] - 10https://gerrit.wikimedia.org/r/856916 [10:09:24] (03PS2) 10Majavah: P:wmcs::metricsinfra: add timer to sync project list [puppet] - 10https://gerrit.wikimedia.org/r/856917 [10:11:26] PROBLEM - Check systemd state on aux-k8s-ctrl1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-journal-flush.service,systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:12:12] (03PS1) 10Majavah: add fake passwords for metricsinfra databases [labs/private] - 10https://gerrit.wikimedia.org/r/856926 [10:13:10] (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] add fake passwords for metricsinfra databases [labs/private] - 10https://gerrit.wikimedia.org/r/856926 (owner: 10Majavah) [10:13:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1199', diff saved to https://phabricator.wikimedia.org/P39675 and previous config saved to /var/cache/conftool/dbconfig/20221115-101349-marostegui.json [10:14:58] (KubernetesCalicoDown) resolved: aux-k8s-ctrl1002.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Faux-k8s&var-instance=aux-k8s-ctrl1002.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [10:15:03] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38159/console" [puppet] - 10https://gerrit.wikimedia.org/r/856916 (owner: 10Majavah) [10:15:34] (03PS1) 10Jcrespo: database-backups: Update partman recipe for dbprov1004 / dbprov2004 [puppet] - 10https://gerrit.wikimedia.org/r/856927 (https://phabricator.wikimedia.org/T322256) [10:16:10] (03PS4) 10Slyngshede: Initial checkin [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/853257 (https://phabricator.wikimedia.org/T313595) [10:18:00] (03PS3) 10Majavah: P:wmcs::metricsinfra: add timer to sync project list [puppet] - 10https://gerrit.wikimedia.org/r/856917 [10:18:45] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38161/console" [puppet] - 10https://gerrit.wikimedia.org/r/856917 (owner: 10Majavah) [10:19:58] (KubernetesCalicoDown) firing: aux-k8s-ctrl1002.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Faux-k8s&var-instance=aux-k8s-ctrl1002.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [10:23:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T321126)', diff saved to https://phabricator.wikimedia.org/P39676 and previous config saved to /var/cache/conftool/dbconfig/20221115-102319-marostegui.json [10:23:21] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1139.eqiad.wmnet with reason: Maintenance [10:23:24] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [10:23:34] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1139.eqiad.wmnet with reason: Maintenance [10:23:42] (03PS2) 10Jbond: utils: add additional selectors to pcc.py [puppet] - 10https://gerrit.wikimedia.org/r/856643 [10:23:50] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1146.eqiad.wmnet with reason: Maintenance [10:24:02] (03CR) 10Jbond: [C: 03+2] apereo_cas: update beaker tests [puppet] - 10https://gerrit.wikimedia.org/r/856668 (owner: 10Jbond) [10:24:03] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1146.eqiad.wmnet with reason: Maintenance [10:24:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3312 (T321126)', diff saved to https://phabricator.wikimedia.org/P39677 and previous config saved to /var/cache/conftool/dbconfig/20221115-102409-marostegui.json [10:24:19] (03CR) 10CI reject: [V: 04-1] utils: add additional selectors to pcc.py [puppet] - 10https://gerrit.wikimedia.org/r/856643 (owner: 10Jbond) [10:26:23] (03PS3) 10Jbond: utils: add additional selectors to pcc.py [puppet] - 10https://gerrit.wikimedia.org/r/856643 [10:26:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T321126)', diff saved to https://phabricator.wikimedia.org/P39678 and previous config saved to /var/cache/conftool/dbconfig/20221115-102631-marostegui.json [10:27:01] (03CR) 10Marostegui: [C: 03+1] database-backups: Update partman recipe for dbprov1004 / dbprov2004 [puppet] - 10https://gerrit.wikimedia.org/r/856927 (https://phabricator.wikimedia.org/T322256) (owner: 10Jcrespo) [10:27:41] (03CR) 10Jcrespo: [C: 03+2] database-backups: Update partman recipe for dbprov1004 / dbprov2004 [puppet] - 10https://gerrit.wikimedia.org/r/856927 (https://phabricator.wikimedia.org/T322256) (owner: 10Jcrespo) [10:28:08] (03CR) 10Arturo Borrero Gonzalez: cloudgw: introduce more robust vlan interface naming (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/856544 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez) [10:28:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1199', diff saved to https://phabricator.wikimedia.org/P39679 and previous config saved to /var/cache/conftool/dbconfig/20221115-102856-marostegui.json [10:36:26] (03PS1) 10Marostegui: Revert "db2162: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/856576 [10:36:34] (03CR) 10Jbond: [C: 03+2] utils: add additional selectors to pcc.py [puppet] - 10https://gerrit.wikimedia.org/r/856643 (owner: 10Jbond) [10:39:58] (KubernetesRsyslogDown) firing: rsyslog on aux-k8s-worker1002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=aux-k8s-worker1002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [10:41:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P39680 and previous config saved to /var/cache/conftool/dbconfig/20221115-104137-marostegui.json [10:41:43] (03PS2) 10Arturo Borrero Gonzalez: ceph: osd: factorize config read [puppet] - 10https://gerrit.wikimedia.org/r/856674 [10:41:45] (03PS5) 10Arturo Borrero Gonzalez: ceph: osd: introduce support for single NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/856675 (https://phabricator.wikimedia.org/T319184) [10:43:46] (03CR) 10Arturo Borrero Gonzalez: ceph: osd: introduce support for single NIC setup (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/856675 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez) [10:44:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T321130)', diff saved to https://phabricator.wikimedia.org/P39681 and previous config saved to /var/cache/conftool/dbconfig/20221115-104402-marostegui.json [10:44:04] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1160.eqiad.wmnet with reason: Maintenance [10:44:08] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [10:44:24] (03PS1) 10David Caro: ceph.roll_restart_*daemons: allow ignoring current health issues [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/856933 [10:44:29] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1160.eqiad.wmnet with reason: Maintenance [10:44:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1160 (T321130)', diff saved to https://phabricator.wikimedia.org/P39682 and previous config saved to /var/cache/conftool/dbconfig/20221115-104435-marostegui.json [10:44:58] (KubernetesRsyslogDown) resolved: rsyslog on aux-k8s-worker1002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=aux-k8s-worker1002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [10:45:05] (03CR) 10CI reject: [V: 04-1] ceph.roll_restart_*daemons: allow ignoring current health issues [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/856933 (owner: 10David Caro) [10:46:36] (03CR) 10Marostegui: [C: 03+2] Revert "db2162: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/856576 (owner: 10Marostegui) [10:47:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2162 (re)pooling @ 1%: After upgrade', diff saved to https://phabricator.wikimedia.org/P39683 and previous config saved to /var/cache/conftool/dbconfig/20221115-104752-root.json [10:49:02] (03PS1) 10Marostegui: Revert "db2166: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/856577 [10:49:05] 10SRE, 10Infrastructure-Foundations, 10CAS-SSO, 10Patch-For-Review: Update CAS to 6.6 - https://phabricator.wikimedia.org/T311235 (10MoritzMuehlenhoff) [10:51:45] (03CR) 10Marostegui: [C: 03+2] Revert "db2166: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/856577 (owner: 10Marostegui) [10:52:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2166 (re)pooling @ 1%: After upgrade', diff saved to https://phabricator.wikimedia.org/P39684 and previous config saved to /var/cache/conftool/dbconfig/20221115-105200-root.json [10:52:50] (03PS1) 10Muehlenhoff: Remove obsolete Puppet references related to decomissioned ELK5 clusters [puppet] - 10https://gerrit.wikimedia.org/r/856934 (https://phabricator.wikimedia.org/T281266) [10:53:05] (03CR) 10David Caro: cloudgw: introduce more robust vlan interface naming (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/856544 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez) [10:55:11] (03CR) 10DCausse: [C: 03+1] cirrus: Increase small cluster heap memory from 8G to 10G [puppet] - 10https://gerrit.wikimedia.org/r/855673 (owner: 10Ebernhardson) [10:56:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P39685 and previous config saved to /var/cache/conftool/dbconfig/20221115-105644-marostegui.json [10:56:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160 (T321130)', diff saved to https://phabricator.wikimedia.org/P39686 and previous config saved to /var/cache/conftool/dbconfig/20221115-105657-marostegui.json [10:57:01] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [10:58:04] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw: introduce more robust vlan interface naming (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/856544 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez) [10:58:07] (03CR) 10David Caro: ceph: osd: introduce support for single NIC setup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/856675 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez) [10:59:02] !log mfossati@deploy1002 Started deploy [airflow-dags/platform_eng@3bb99c2]: (no justification provided) [10:59:07] !log mfossati@deploy1002 Finished deploy [airflow-dags/platform_eng@3bb99c2]: (no justification provided) (duration: 00m 05s) [11:02:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2162 (re)pooling @ 3%: After upgrade', diff saved to https://phabricator.wikimedia.org/P39687 and previous config saved to /var/cache/conftool/dbconfig/20221115-110257-root.json [11:02:59] (03CR) 10David Caro: wmcs: add cookbook to add/remove a user to/from a project (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/851650 (owner: 10David Caro) [11:05:11] !log aborrero@cumin1001 START - Cookbook sre.hosts.reimage for host cloudgw2002-dev.codfw.wmnet with OS bullseye [11:05:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST nodes) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=aux-k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:07:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2166 (re)pooling @ 3%: After upgrade', diff saved to https://phabricator.wikimedia.org/P39688 and previous config saved to /var/cache/conftool/dbconfig/20221115-110705-root.json [11:10:12] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti6001.drmrs.wmnet [11:10:50] (03Abandoned) 10Muehlenhoff: Sync to 6.6.2 of the CAS overlay [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/854998 (owner: 10Muehlenhoff) [11:10:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST nodes) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=aux-k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:11:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T321126)', diff saved to https://phabricator.wikimedia.org/P39689 and previous config saved to /var/cache/conftool/dbconfig/20221115-111150-marostegui.json [11:11:53] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1156.eqiad.wmnet with reason: Maintenance [11:11:55] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [11:12:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P39690 and previous config saved to /var/cache/conftool/dbconfig/20221115-111203-marostegui.json [11:12:06] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1156.eqiad.wmnet with reason: Maintenance [11:12:07] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [11:12:17] (03CR) 10Muehlenhoff: [C: 03+2] postgres: Remove support for Stretch [puppet] - 10https://gerrit.wikimedia.org/r/812230 (owner: 10Muehlenhoff) [11:12:23] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [11:12:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1156 (T321126)', diff saved to https://phabricator.wikimedia.org/P39691 and previous config saved to /var/cache/conftool/dbconfig/20221115-111229-marostegui.json [11:12:41] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] wmcs.ceph.set_cluster_in_maintenance: fix bad parameter [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/855679 (owner: 10David Caro) [11:13:36] (03PS6) 10Btullis: Add a spark-operator chart and helmfile configuraiton [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) [11:14:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T321126)', diff saved to https://phabricator.wikimedia.org/P39692 and previous config saved to /var/cache/conftool/dbconfig/20221115-111449-marostegui.json [11:15:34] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] ceph.roll_restart_*daemons: allow ignoring current health issues (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/856933 (owner: 10David Caro) [11:15:59] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] ceph: osd: factorize config read (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/856674 (owner: 10Arturo Borrero Gonzalez) [11:16:26] PROBLEM - Check systemd state on aux-k8s-ctrl1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-journal-flush.service,systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:17:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti6001.drmrs.wmnet [11:18:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2162 (re)pooling @ 5%: After upgrade', diff saved to https://phabricator.wikimedia.org/P39693 and previous config saved to /var/cache/conftool/dbconfig/20221115-111802-root.json [11:18:08] RECOVERY - SSH on aux-k8s-ctrl1002 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:19:58] (KubernetesCalicoDown) resolved: aux-k8s-ctrl1002.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Faux-k8s&var-instance=aux-k8s-ctrl1002.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [11:20:07] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudgw2002-dev.codfw.wmnet with reason: host reimage [11:20:31] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-collab: Q1:rack/setup/install contint1002 - https://phabricator.wikimedia.org/T313830 (10LSobanski) [11:20:39] !log failover ganeti master in drmrs/B12 to ganeti6001 [11:20:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2166 (re)pooling @ 5%: After upgrade', diff saved to https://phabricator.wikimedia.org/P39694 and previous config saved to /var/cache/conftool/dbconfig/20221115-112210-root.json [11:22:47] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudgw2002-dev.codfw.wmnet with reason: host reimage [11:22:47] (03CR) 10Vgutierrez: [C: 03+2] prometheus/trafficserver: Remove node_ats_config [puppet] - 10https://gerrit.wikimedia.org/r/856593 (https://phabricator.wikimedia.org/T292815) (owner: 10Vgutierrez) [11:24:50] (03CR) 10Hnowlan: [C: 03+2] swift: reenable more logging [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/856629 (owner: 10Hnowlan) [11:27:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P39695 and previous config saved to /var/cache/conftool/dbconfig/20221115-112709-marostegui.json [11:29:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P39696 and previous config saved to /var/cache/conftool/dbconfig/20221115-112956-marostegui.json [11:33:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2162 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P39697 and previous config saved to /var/cache/conftool/dbconfig/20221115-113307-root.json [11:33:29] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti6003.drmrs.wmnet [11:35:56] (03PS1) 10Marostegui: mariadb: Make codfw hosts ping when going down [puppet] - 10https://gerrit.wikimedia.org/r/856939 (https://phabricator.wikimedia.org/T322987) [11:36:19] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence-Backup: Q1:rack/setup/install dbprov2004 - https://phabricator.wikimedia.org/T321128 (10jcrespo) Patch that should help: https://gerrit.wikimedia.org/r/856927 (asuming HDs RAID is sda) [11:36:54] (03Merged) 10jenkins-bot: swift: reenable more logging [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/856629 (owner: 10Hnowlan) [11:37:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2166 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P39698 and previous config saved to /var/cache/conftool/dbconfig/20221115-113715-root.json [11:37:16] (03CR) 10Marostegui: "John, Jaime, Filippo, I would like to ask for a review of this, more context at https://phabricator.wikimedia.org/T322987" [puppet] - 10https://gerrit.wikimedia.org/r/856939 (https://phabricator.wikimedia.org/T322987) (owner: 10Marostegui) [11:42:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160 (T321130)', diff saved to https://phabricator.wikimedia.org/P39699 and previous config saved to /var/cache/conftool/dbconfig/20221115-114216-marostegui.json [11:42:21] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [11:42:52] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudgw2002-dev.codfw.wmnet with OS bullseye [11:45:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P39700 and previous config saved to /var/cache/conftool/dbconfig/20221115-114502-marostegui.json [11:45:18] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host ganeti6003.drmrs.wmnet [11:45:30] PROBLEM - ganeti-wconfd running on ganeti6003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 115 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [11:46:17] (03CR) 10Jcrespo: [C: 03+1] ""This isn't the most elegant approach" was said when it was initially rolled out. Ideally in the future this would be controlled by primar" [puppet] - 10https://gerrit.wikimedia.org/r/856939 (https://phabricator.wikimedia.org/T322987) (owner: 10Marostegui) [11:46:56] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1102.eqiad.wmnet with reason: Maintenance [11:46:58] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1102.eqiad.wmnet with reason: Maintenance [11:46:59] (03CR) 10Jcrespo: [C: 03+1] "Let me just double check it matches eqiad, except for misc." [puppet] - 10https://gerrit.wikimedia.org/r/856939 (https://phabricator.wikimedia.org/T322987) (owner: 10Marostegui) [11:47:25] (03CR) 10Filippo Giunchedi: [C: 03+1] "Ship it!" [puppet] - 10https://gerrit.wikimedia.org/r/854499 (https://phabricator.wikimedia.org/T314981) (owner: 10Elukey) [11:48:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2162 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P39701 and previous config saved to /var/cache/conftool/dbconfig/20221115-114812-root.json [11:48:58] (03PS1) 10Hnowlan: thumbor: version bump [deployment-charts] - 10https://gerrit.wikimedia.org/r/856941 (https://phabricator.wikimedia.org/T233196) [11:50:24] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:50:50] (03CR) 10Marostegui: [C: 03+1] "I think this is fine. However maybe it is better to deploy with a DBA handy just in case heartbeats processes fail or something" [puppet] - 10https://gerrit.wikimedia.org/r/837120 (owner: 10Muehlenhoff) [11:51:28] (03CR) 10Jcrespo: [C: 03+1] "It does, so still +1. The thing that may be left is reviewing the Read only paging setup for misc hosts, which seem to be paging for non c" [puppet] - 10https://gerrit.wikimedia.org/r/856939 (https://phabricator.wikimedia.org/T322987) (owner: 10Marostegui) [11:51:37] (03CR) 10Elukey: [C: 03+2] centrallog: add first prototype of webrequest-live with Benthos [puppet] - 10https://gerrit.wikimedia.org/r/854499 (https://phabricator.wikimedia.org/T314981) (owner: 10Elukey) [11:52:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2166 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P39702 and previous config saved to /var/cache/conftool/dbconfig/20221115-115220-root.json [11:55:50] (03CR) 10Marostegui: mariadb: Make codfw hosts ping when going down (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/856939 (https://phabricator.wikimedia.org/T322987) (owner: 10Marostegui) [11:55:52] (03CR) 10Marostegui: [C: 03+2] mariadb: Make codfw hosts ping when going down [puppet] - 10https://gerrit.wikimedia.org/r/856939 (https://phabricator.wikimedia.org/T322987) (owner: 10Marostegui) [11:58:14] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:00:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T321126)', diff saved to https://phabricator.wikimedia.org/P39703 and previous config saved to /var/cache/conftool/dbconfig/20221115-120009-marostegui.json [12:00:11] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1162.eqiad.wmnet with reason: Maintenance [12:00:14] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [12:00:24] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1162.eqiad.wmnet with reason: Maintenance [12:00:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1162 (T321126)', diff saved to https://phabricator.wikimedia.org/P39704 and previous config saved to /var/cache/conftool/dbconfig/20221115-120030-marostegui.json [12:01:35] (03CR) 10Hnowlan: [C: 03+2] thumbor: version bump [deployment-charts] - 10https://gerrit.wikimedia.org/r/856941 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [12:02:13] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: ensure vlan package is installed [puppet] - 10https://gerrit.wikimedia.org/r/856943 [12:02:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T321126)', diff saved to https://phabricator.wikimedia.org/P39705 and previous config saved to /var/cache/conftool/dbconfig/20221115-120249-marostegui.json [12:03:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2162 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P39706 and previous config saved to /var/cache/conftool/dbconfig/20221115-120316-root.json [12:04:26] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1112.eqiad.wmnet with reason: Maintenance [12:04:40] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1112.eqiad.wmnet with reason: Maintenance [12:04:41] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [12:04:49] (03CR) 10CI reject: [V: 04-1] cloudgw: ensure vlan package is installed [puppet] - 10https://gerrit.wikimedia.org/r/856943 (owner: 10Arturo Borrero Gonzalez) [12:04:56] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [12:05:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1112 (T321130)', diff saved to https://phabricator.wikimedia.org/P39707 and previous config saved to /var/cache/conftool/dbconfig/20221115-120502-marostegui.json [12:05:07] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [12:06:00] (03Merged) 10jenkins-bot: thumbor: version bump [deployment-charts] - 10https://gerrit.wikimedia.org/r/856941 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [12:06:43] (03PS1) 10Arturo Borrero Gonzalez: spec: profile_wmcs_services_postgres_osm_primary_spec: don't try debian 9 [puppet] - 10https://gerrit.wikimedia.org/r/856944 [12:07:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2166 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P39708 and previous config saved to /var/cache/conftool/dbconfig/20221115-120725-root.json [12:08:02] (03PS2) 10Arturo Borrero Gonzalez: cloudgw: ensure vlan package is installed [puppet] - 10https://gerrit.wikimedia.org/r/856943 [12:11:01] !log hnowlan@cumin1001 START - Cookbook sre.postgresql.postgres-init [12:11:16] !log resyncing maps2005 replica [12:11:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:51] (03CR) 10FNegri: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/856943 (owner: 10Arturo Borrero Gonzalez) [12:13:30] (03PS4) 10Giuseppe Lavagetto: Add rake task to perform basic conversions [deployment-charts] - 10https://gerrit.wikimedia.org/r/855668 [12:13:32] (03PS2) 10Giuseppe Lavagetto: Add rake task to convert deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/856517 [12:13:34] (03PS1) 10Giuseppe Lavagetto: apertium: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/856945 [12:13:37] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: sync [12:13:55] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/856944 (owner: 10Arturo Borrero Gonzalez) [12:14:04] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] spec: profile_wmcs_services_postgres_osm_primary_spec: don't try debian 9 [puppet] - 10https://gerrit.wikimedia.org/r/856944 (owner: 10Arturo Borrero Gonzalez) [12:14:19] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw: ensure vlan package is installed [puppet] - 10https://gerrit.wikimedia.org/r/856943 (owner: 10Arturo Borrero Gonzalez) [12:14:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T321130)', diff saved to https://phabricator.wikimedia.org/P39709 and previous config saved to /var/cache/conftool/dbconfig/20221115-121431-marostegui.json [12:14:34] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: sync [12:14:35] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [12:16:12] !log aborrero@cumin1001 START - Cookbook sre.hosts.reimage for host cloudgw2003-dev.codfw.wmnet with OS bullseye [12:16:29] RECOVERY - SSH on mw1326.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:17:12] (03CR) 10FNegri: [C: 03+1] "I tested this locally on my machine and it's working! I also verified I could query puppetdb by adding a simple query to a test cookbook." [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/852960 (https://phabricator.wikimedia.org/T319426) (owner: 10David Caro) [12:17:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P39710 and previous config saved to /var/cache/conftool/dbconfig/20221115-121755-marostegui.json [12:18:05] (03PS1) 10Ssingh: sites.yaml: remove decommissioned host lvs4005 [homer/public] - 10https://gerrit.wikimedia.org/r/856946 (https://phabricator.wikimedia.org/T317247) [12:18:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2162 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P39711 and previous config saved to /var/cache/conftool/dbconfig/20221115-121821-root.json [12:18:26] XioNoX: ^ sorry! I will push this right away [12:18:35] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti6002.drmrs.wmnet [12:22:26] (03PS1) 10Filippo Giunchedi: webrequest_live: add output label and limit threads [puppet] - 10https://gerrit.wikimedia.org/r/856948 (https://phabricator.wikimedia.org/T314981) [12:22:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2166 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P39712 and previous config saved to /var/cache/conftool/dbconfig/20221115-122230-root.json [12:22:45] (03CR) 10Elukey: [C: 03+1] webrequest_live: add output label and limit threads [puppet] - 10https://gerrit.wikimedia.org/r/856948 (https://phabricator.wikimedia.org/T314981) (owner: 10Filippo Giunchedi) [12:25:33] (03CR) 10Filippo Giunchedi: [C: 03+2] webrequest_live: add output label and limit threads [puppet] - 10https://gerrit.wikimedia.org/r/856948 (https://phabricator.wikimedia.org/T314981) (owner: 10Filippo Giunchedi) [12:25:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti6002.drmrs.wmnet [12:27:10] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] P:wmcs::metricsinfra: use `pip install -e .` [puppet] - 10https://gerrit.wikimedia.org/r/856916 (owner: 10Majavah) [12:27:27] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] P:wmcs::metricsinfra: add timer to sync project list [puppet] - 10https://gerrit.wikimedia.org/r/856917 (owner: 10Majavah) [12:27:43] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:27:56] (03PS4) 10Arturo Borrero Gonzalez: P:wmcs::metricsinfra: add timer to sync project list [puppet] - 10https://gerrit.wikimedia.org/r/856917 (owner: 10Majavah) [12:29:22] (03PS2) 10Giuseppe Lavagetto: apertium: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/856945 [12:29:24] (03PS1) 10Giuseppe Lavagetto: api-gateway: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/856950 [12:29:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P39713 and previous config saved to /var/cache/conftool/dbconfig/20221115-122937-marostegui.json [12:30:22] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:30:27] (03PS1) 10Filippo Giunchedi: benthos: differentiate input/output labels [puppet] - 10https://gerrit.wikimedia.org/r/856951 (https://phabricator.wikimedia.org/T314981) [12:30:53] (03CR) 10Elukey: [C: 03+1] benthos: differentiate input/output labels [puppet] - 10https://gerrit.wikimedia.org/r/856951 (https://phabricator.wikimedia.org/T314981) (owner: 10Filippo Giunchedi) [12:31:56] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudgw2003-dev.codfw.wmnet with reason: host reimage [12:32:18] PROBLEM - ganeti-wconfd running on ganeti6004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 115 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [12:32:36] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] benthos: differentiate input/output labels [puppet] - 10https://gerrit.wikimedia.org/r/856951 (https://phabricator.wikimedia.org/T314981) (owner: 10Filippo Giunchedi) [12:33:01] (03PS1) 10Majavah: P:wmcs::metricsinfra: fix user [puppet] - 10https://gerrit.wikimedia.org/r/856952 [12:33:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P39714 and previous config saved to /var/cache/conftool/dbconfig/20221115-123302-marostegui.json [12:33:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2162 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P39715 and previous config saved to /var/cache/conftool/dbconfig/20221115-123326-root.json [12:35:50] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] P:wmcs::metricsinfra: fix user [puppet] - 10https://gerrit.wikimedia.org/r/856952 (owner: 10Majavah) [12:36:59] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudgw2003-dev.codfw.wmnet with reason: host reimage [12:37:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2166 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P39716 and previous config saved to /var/cache/conftool/dbconfig/20221115-123735-root.json [12:37:42] (03PS1) 10Ssingh: hiera: update Traffic cloud instances hieradata for digicert-2022 [puppet] - 10https://gerrit.wikimedia.org/r/856955 [12:39:10] (03PS1) 10Jbond: puppet_compiler: update to server every thing from puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/856956 [12:41:23] (03PS2) 10Jbond: puppet_compiler: update to server every thing from puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/856956 [12:42:04] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38163/console" [puppet] - 10https://gerrit.wikimedia.org/r/856956 (owner: 10Jbond) [12:43:42] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp4052.ulsfo.wmnet,service=ats-tls [12:43:42] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp4052.ulsfo.wmnet,service=ats-be [12:43:42] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp4052.ulsfo.wmnet,service=varnish-fe [12:44:02] (03PS3) 10Jbond: puppet_compiler: update to server every thing from puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/856956 [12:44:43] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38164/console" [puppet] - 10https://gerrit.wikimedia.org/r/856956 (owner: 10Jbond) [12:44:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P39717 and previous config saved to /var/cache/conftool/dbconfig/20221115-124443-marostegui.json [12:45:33] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on idp-test1002.wikimedia.org with reason: experiment with CAS 6.6 [12:45:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on idp-test1002.wikimedia.org with reason: experiment with CAS 6.6 [12:46:51] (03CR) 10CI reject: [V: 04-1] puppet_compiler: update to server every thing from puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/856956 (owner: 10Jbond) [12:47:06] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti6004.drmrs.wmnet [12:47:34] (03PS1) 10Arturo Borrero Gonzalez: cloudgw2003-dev: move to the new vlan NIC name [puppet] - 10https://gerrit.wikimedia.org/r/856959 [12:47:36] (03PS4) 10Jbond: puppet_compiler: update to server every thing from puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/856956 [12:48:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T321126)', diff saved to https://phabricator.wikimedia.org/P39718 and previous config saved to /var/cache/conftool/dbconfig/20221115-124808-marostegui.json [12:48:10] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1170.eqiad.wmnet with reason: Maintenance [12:48:13] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [12:48:20] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38166/console" [puppet] - 10https://gerrit.wikimedia.org/r/856956 (owner: 10Jbond) [12:48:24] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1170.eqiad.wmnet with reason: Maintenance [12:48:28] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:48:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3312 (T321126)', diff saved to https://phabricator.wikimedia.org/P39719 and previous config saved to /var/cache/conftool/dbconfig/20221115-124830-marostegui.json [12:49:41] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudgw2003-dev.codfw.wmnet with OS bullseye [12:50:29] (03CR) 10CI reject: [V: 04-1] puppet_compiler: update to server every thing from puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/856956 (owner: 10Jbond) [12:50:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T321126)', diff saved to https://phabricator.wikimedia.org/P39720 and previous config saved to /var/cache/conftool/dbconfig/20221115-125050-marostegui.json [12:51:49] (03PS2) 10Arturo Borrero Gonzalez: cloudgw2003-dev: move to the new vlan NIC name [puppet] - 10https://gerrit.wikimedia.org/r/856959 [12:51:51] (03PS1) 10Arturo Borrero Gonzalez: cloudgw2002-dev: cleanup leftovers [puppet] - 10https://gerrit.wikimedia.org/r/856962 [12:51:53] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: codfw1dev: move common hiera keys to the profile [puppet] - 10https://gerrit.wikimedia.org/r/856963 [12:52:16] (03PS5) 10Jbond: puppet_compiler: update to server every thing from puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/856956 [12:52:58] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38167/console" [puppet] - 10https://gerrit.wikimedia.org/r/856956 (owner: 10Jbond) [12:53:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti6004.drmrs.wmnet [12:55:09] (03CR) 10Ayounsi: [C: 03+1] sites.yaml: remove decommissioned host lvs4005 [homer/public] - 10https://gerrit.wikimedia.org/r/856946 (https://phabricator.wikimedia.org/T317247) (owner: 10Ssingh) [12:56:19] (03PS1) 10Majavah: P:wmcs::metricsinfra: allow excluding projects from the sync [puppet] - 10https://gerrit.wikimedia.org/r/856967 [12:58:50] PROBLEM - Check systemd state on apt1001 is CRITICAL: CRITICAL - degraded: The following units failed: nginx.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:59:02] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 13335 [12:59:39] ^apt1001 is me, currently fixing [12:59:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T321130)', diff saved to https://phabricator.wikimedia.org/P39721 and previous config saved to /var/cache/conftool/dbconfig/20221115-125950-marostegui.json [12:59:52] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1145.eqiad.wmnet with reason: Maintenance [12:59:56] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [13:00:06] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1145.eqiad.wmnet with reason: Maintenance [13:00:19] RECOVERY - Check systemd state on apt1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:01:59] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] P:wmcs::metricsinfra: allow excluding projects from the sync [puppet] - 10https://gerrit.wikimedia.org/r/856967 (owner: 10Majavah) [13:03:49] (03PS1) 10Arturo Borrero Gonzalez: cloudgw1001: prepare for reimage into the new vlan NIC name with a single NIC [puppet] - 10https://gerrit.wikimedia.org/r/856969 (https://phabricator.wikimedia.org/T319184) [13:05:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P39722 and previous config saved to /var/cache/conftool/dbconfig/20221115-130557-marostegui.json [13:08:54] !log failover ganeti master in ulsfo to ganeti4005 [13:08:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:34] (03CR) 10Jbond: [V: 03+1 C: 03+2] puppet_compiler: update to server every thing from puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/856956 (owner: 10Jbond) [13:12:49] (03PS1) 10Jbond: puppet_compiler: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/856971 [13:13:01] (03PS2) 10Jbond: puppet_compiler: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/856971 [13:13:05] (03CR) 10Jbond: [V: 03+2 C: 03+2] puppet_compiler: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/856971 (owner: 10Jbond) [13:16:50] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1157.eqiad.wmnet with reason: Maintenance [13:17:04] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1157.eqiad.wmnet with reason: Maintenance [13:17:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1157 (T321130)', diff saved to https://phabricator.wikimedia.org/P39723 and previous config saved to /var/cache/conftool/dbconfig/20221115-131710-marostegui.json [13:17:15] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [13:17:41] (03CR) 10FNegri: [C: 03+1] cloudgw2003-dev: move to the new vlan NIC name [puppet] - 10https://gerrit.wikimedia.org/r/856959 (owner: 10Arturo Borrero Gonzalez) [13:18:40] (03CR) 10FNegri: [C: 03+1] cloudgw2002-dev: cleanup leftovers [puppet] - 10https://gerrit.wikimedia.org/r/856962 (owner: 10Arturo Borrero Gonzalez) [13:19:55] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 13335 [13:20:44] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: sync [13:20:48] (03CR) 10Ssingh: [C: 03+2] sites.yaml: remove decommissioned host lvs4005 [homer/public] - 10https://gerrit.wikimedia.org/r/856946 (https://phabricator.wikimedia.org/T317247) (owner: 10Ssingh) [13:20:57] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: sync [13:21:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P39724 and previous config saved to /var/cache/conftool/dbconfig/20221115-132103-marostegui.json [13:22:32] !log running homer for Gerrit: 856946 in cr*-ulsfo* [13:22:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:00] (03PS1) 10Jbond: puppet_compiler: use alias not root [puppet] - 10https://gerrit.wikimedia.org/r/856974 [13:23:21] (03CR) 10Jbond: [V: 03+2 C: 03+2] puppet_compiler: use alias not root [puppet] - 10https://gerrit.wikimedia.org/r/856974 (owner: 10Jbond) [13:24:09] RECOVERY - BGP status on cr3-ulsfo is OK: BGP OK - up: 87, down: 2, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:24:17] 10SRE, 10ops-eqsin, 10Infrastructure-Foundations, 10netops: asw1-eqsin: VC mastership change - https://phabricator.wikimedia.org/T323094 (10ayounsi) Note that we will need to move the mastership back to `asw-0604-eqsin` to keep everything standardized. For that, better depool the site. [13:24:23] PROBLEM - ganeti-wconfd running on ganeti4008 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 115 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [13:25:23] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:26:26] 10SRE, 10ops-codfw, 10DBA: db2173 HW errors - https://phabricator.wikimedia.org/T322988 (10Marostegui) [13:26:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T321130)', diff saved to https://phabricator.wikimedia.org/P39725 and previous config saved to /var/cache/conftool/dbconfig/20221115-132637-marostegui.json [13:26:43] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [13:28:06] (03PS5) 10Ayounsi: Add Peering News to Puppet [puppet] - 10https://gerrit.wikimedia.org/r/849114 [13:29:36] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet [13:30:29] (03CR) 10Muehlenhoff: [C: 03+2] lists: Enable profile::auto_restarts::service for Apache [puppet] - 10https://gerrit.wikimedia.org/r/849561 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [13:30:50] 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, 10Platform Team Workboards (Platform Engineering Reliability): Encoding issues when handling unicode characters in filenames - https://phabricator.wikimedia.org/T323114 (10hnowlan) [13:32:09] (03PS1) 10Jbond: Revert "puppet_compiler: update to server every thing from puppetdb" [puppet] - 10https://gerrit.wikimedia.org/r/856578 [13:32:33] (03CR) 10CI reject: [V: 04-1] Revert "puppet_compiler: update to server every thing from puppetdb" [puppet] - 10https://gerrit.wikimedia.org/r/856578 (owner: 10Jbond) [13:35:10] (03PS2) 10Jbond: Revert "puppet_compiler: update to server every thing from puppetdb" [puppet] - 10https://gerrit.wikimedia.org/r/856578 [13:36:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4008.ulsfo.wmnet [13:36:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T321126)', diff saved to https://phabricator.wikimedia.org/P39726 and previous config saved to /var/cache/conftool/dbconfig/20221115-133610-marostegui.json [13:36:12] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1182.eqiad.wmnet with reason: Maintenance [13:36:18] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [13:36:26] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1182.eqiad.wmnet with reason: Maintenance [13:36:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1182 (T321126)', diff saved to https://phabricator.wikimedia.org/P39727 and previous config saved to /var/cache/conftool/dbconfig/20221115-133631-marostegui.json [13:38:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T321126)', diff saved to https://phabricator.wikimedia.org/P39728 and previous config saved to /var/cache/conftool/dbconfig/20221115-133852-marostegui.json [13:39:15] (03PS3) 10Jbond: Revert "puppet_compiler: update to server every thing from puppetdb" [puppet] - 10https://gerrit.wikimedia.org/r/856578 [13:39:39] (03PS4) 10Jbond: Revert "puppet_compiler: update to server every thing from puppetdb" [puppet] - 10https://gerrit.wikimedia.org/r/856578 [13:40:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1099.eqiad.wmnet with reason: Maintenance [13:40:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2097.codfw.wmnet with reason: Maintenance [13:40:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1099.eqiad.wmnet with reason: Maintenance [13:40:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1099:3311 (T318605)', diff saved to https://phabricator.wikimedia.org/P39729 and previous config saved to /var/cache/conftool/dbconfig/20221115-134036-ladsgroup.json [13:40:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2097.codfw.wmnet with reason: Maintenance [13:40:41] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [13:40:46] (03PS5) 10Jbond: Revert "puppet_compiler: update to server every thing from puppetdb" [puppet] - 10https://gerrit.wikimedia.org/r/856578 [13:41:09] (03CR) 10Jbond: [V: 03+2 C: 03+2] Revert "puppet_compiler: update to server every thing from puppetdb" [puppet] - 10https://gerrit.wikimedia.org/r/856578 (owner: 10Jbond) [13:41:35] (03CR) 10Filippo Giunchedi: [C: 03+2] "Thank you for the context, LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/850628 (owner: 10Majavah) [13:41:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P39730 and previous config saved to /var/cache/conftool/dbconfig/20221115-134144-marostegui.json [13:43:56] (03CR) 10Filippo Giunchedi: [C: 04-1] "Change LGTM, though it'll trigger alerts re: auth" [puppet] - 10https://gerrit.wikimedia.org/r/856612 (https://phabricator.wikimedia.org/T313229) (owner: 10Herron) [13:44:20] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38171/console" [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson) [13:44:36] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1162 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/856496 (https://phabricator.wikimedia.org/T323116) [13:44:39] (03PS1) 10Gerrit maintenance bot: wmnet: Update s2-master alias [dns] - 10https://gerrit.wikimedia.org/r/856497 (https://phabricator.wikimedia.org/T323116) [13:44:50] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:45:11] (03CR) 10Filippo Giunchedi: [C: 03+2] cfssl: change intermediate key path [puppet] - 10https://gerrit.wikimedia.org/r/853281 (https://phabricator.wikimedia.org/T319163) (owner: 10Filippo Giunchedi) [13:45:51] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1136 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/856498 (https://phabricator.wikimedia.org/T323117) [13:45:55] (03PS1) 10Gerrit maintenance bot: wmnet: Update s7-master alias [dns] - 10https://gerrit.wikimedia.org/r/856499 (https://phabricator.wikimedia.org/T323117) [13:49:17] (03PS8) 10Vgutierrez: Varnish analytics: support differential privacy [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson) [13:51:51] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Consider lowering IPv6 TCP MSS - https://phabricator.wikimedia.org/T283058 (10ayounsi) 05Open→03Declined We haven't seen any issue related to this so closing the task. [13:52:48] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti3001.esams.wmnet [13:53:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P39731 and previous config saved to /var/cache/conftool/dbconfig/20221115-135358-marostegui.json [13:55:09] (03PS1) 10Jbond: puppet_compiler: move clean up jobs and proxy site to db instance [puppet] - 10https://gerrit.wikimedia.org/r/856983 [13:56:32] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host cp5032.mgmt.eqsin.wmnet with reboot policy FORCED [13:56:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P39732 and previous config saved to /var/cache/conftool/dbconfig/20221115-135650-marostegui.json [13:57:45] (03CR) 10CI reject: [V: 04-1] puppet_compiler: move clean up jobs and proxy site to db instance [puppet] - 10https://gerrit.wikimedia.org/r/856983 (owner: 10Jbond) [13:58:33] (03PS9) 10Vgutierrez: Varnish analytics: support differential privacy [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson) [14:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC afternoon backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221115T1400). [14:00:05] No Gerrit patches in the queue for this window AFAICS. [14:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221115T1400) [14:00:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti3001.esams.wmnet [14:04:52] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:07:52] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw2003-dev: move to the new vlan NIC name [puppet] - 10https://gerrit.wikimedia.org/r/856959 (owner: 10Arturo Borrero Gonzalez) [14:07:57] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp5032.mgmt.eqsin.wmnet with reboot policy FORCED [14:08:38] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw2002-dev: cleanup leftovers [puppet] - 10https://gerrit.wikimedia.org/r/856962 (owner: 10Arturo Borrero Gonzalez) [14:09:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P39733 and previous config saved to /var/cache/conftool/dbconfig/20221115-140905-marostegui.json [14:09:10] (03PS2) 10Arturo Borrero Gonzalez: cloudgw: codfw1dev: move common hiera keys to the profile [puppet] - 10https://gerrit.wikimedia.org/r/856963 [14:09:35] !log aborrero@cumin1001 START - Cookbook sre.hosts.reimage for host cloudgw2003-dev.codfw.wmnet with OS bullseye [14:10:54] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:11:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T321130)', diff saved to https://phabricator.wikimedia.org/P39734 and previous config saved to /var/cache/conftool/dbconfig/20221115-141157-marostegui.json [14:11:59] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1166.eqiad.wmnet with reason: Maintenance [14:12:02] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [14:12:12] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1166.eqiad.wmnet with reason: Maintenance [14:12:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T321130)', diff saved to https://phabricator.wikimedia.org/P39735 and previous config saved to /var/cache/conftool/dbconfig/20221115-141218-marostegui.json [14:12:45] (03PS10) 10Vgutierrez: Varnish analytics: support differential privacy [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson) [14:13:15] (03CR) 10Hokwelum: snapshot: Apply minor cleanups to cirrus dump script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/856653 (owner: 10Ebernhardson) [14:14:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1136.eqiad.wmnet with reason: Maintenance [14:15:07] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1136.eqiad.wmnet with reason: Maintenance [14:15:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1136 (T318955)', diff saved to https://phabricator.wikimedia.org/P39736 and previous config saved to /var/cache/conftool/dbconfig/20221115-141513-ladsgroup.json [14:15:18] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [14:17:18] (03PS7) 10Jbond: 2.5.0: prepare release [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/852837 [14:17:20] (03PS1) 10Jbond: nodegen: Fix issue when only one result is returned [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/856987 [14:17:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T318955)', diff saved to https://phabricator.wikimedia.org/P39737 and previous config saved to /var/cache/conftool/dbconfig/20221115-141723-ladsgroup.json [14:19:09] (03PS2) 10Jbond: puppet_compiler: move clean up jobs and proxy site to db instance [puppet] - 10https://gerrit.wikimedia.org/r/856983 [14:19:30] (03PS4) 10Clare Ming: EditAttemptStep sampling rate to 1 everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/854570 (https://phabricator.wikimedia.org/T312016) (owner: 10Phuedx) [14:19:41] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti3002.esams.wmnet [14:19:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:21:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1202.eqiad.wmnet with reason: Maintenance [14:21:11] (03PS11) 10Vgutierrez: Varnish analytics: support differential privacy [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson) [14:21:24] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1202.eqiad.wmnet with reason: Maintenance [14:21:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1202 (T318950)', diff saved to https://phabricator.wikimedia.org/P39738 and previous config saved to /var/cache/conftool/dbconfig/20221115-142130-ladsgroup.json [14:21:38] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [14:21:46] (03CR) 10CI reject: [V: 04-1] puppet_compiler: move clean up jobs and proxy site to db instance [puppet] - 10https://gerrit.wikimedia.org/r/856983 (owner: 10Jbond) [14:23:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T318950)', diff saved to https://phabricator.wikimedia.org/P39739 and previous config saved to /var/cache/conftool/dbconfig/20221115-142342-ladsgroup.json [14:24:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T321126)', diff saved to https://phabricator.wikimedia.org/P39740 and previous config saved to /var/cache/conftool/dbconfig/20221115-142411-marostegui.json [14:24:13] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1188.eqiad.wmnet with reason: Maintenance [14:24:16] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [14:24:27] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1188.eqiad.wmnet with reason: Maintenance [14:24:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1188 (T321126)', diff saved to https://phabricator.wikimedia.org/P39741 and previous config saved to /var/cache/conftool/dbconfig/20221115-142432-marostegui.json [14:24:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:25:20] (03CR) 10Clare Ming: [C: 03+1] EditAttemptStep sampling rate to 1 everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/854570 (https://phabricator.wikimedia.org/T312016) (owner: 10Phuedx) [14:25:22] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudgw2003-dev.codfw.wmnet with reason: host reimage [14:26:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T321126)', diff saved to https://phabricator.wikimedia.org/P39742 and previous config saved to /var/cache/conftool/dbconfig/20221115-142652-marostegui.json [14:27:25] (03PS3) 10Jbond: puppet_compiler: move clean up jobs and proxy site to db instance [puppet] - 10https://gerrit.wikimedia.org/r/856983 [14:27:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti3002.esams.wmnet [14:27:49] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudgw2003-dev.codfw.wmnet with reason: host reimage [14:28:08] PROBLEM - puppet last run on sretest1001 is CRITICAL: CRITICAL: Puppet has been disabled for 604866 seconds, message: alex testing - akosiaris, last run 7 days ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [14:29:03] jouncebot: nowandnext [14:29:03] For the next 0 hour(s) and 30 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221115T1400) [14:29:03] For the next 0 hour(s) and 30 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221115T1400) [14:29:03] In 2 hour(s) and 30 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221115T1700) [14:29:17] * urbanecm steals the remainder of the B&C window [14:29:41] (03PS1) 10Urbanecm: updateIsActiveFlagForMentees: Process all mentees [extensions/GrowthExperiments] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/856579 (https://phabricator.wikimedia.org/T318457) [14:30:48] (03CR) 10Urbanecm: [C: 03+2] updateIsActiveFlagForMentees: Process all mentees [extensions/GrowthExperiments] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/856579 (https://phabricator.wikimedia.org/T318457) (owner: 10Urbanecm) [14:31:03] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp5032'] [14:31:20] (03PS1) 10Urbanecm: MentorStore: Use $wgRCMaxAge instead of INACTIVITY_THRESHOLD [extensions/GrowthExperiments] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/856988 (https://phabricator.wikimedia.org/T318457) [14:31:35] (03CR) 10Urbanecm: [C: 03+2] MentorStore: Use $wgRCMaxAge instead of INACTIVITY_THRESHOLD [extensions/GrowthExperiments] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/856988 (https://phabricator.wikimedia.org/T318457) (owner: 10Urbanecm) [14:32:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P39743 and previous config saved to /var/cache/conftool/dbconfig/20221115-143229-ladsgroup.json [14:32:57] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/GrowthExperiments] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/856579 (https://phabricator.wikimedia.org/T318457) (owner: 10Urbanecm) [14:32:59] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/GrowthExperiments] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/856988 (https://phabricator.wikimedia.org/T318457) (owner: 10Urbanecm) [14:33:53] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.postgresql.postgres-init (exit_code=0) [14:34:20] PROBLEM - Check systemd state on maps2005 is CRITICAL: CRITICAL - degraded: The following units failed: postgresql@11-main.service,prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:35:08] while all the deployers are around… could someone check on the maintenance script for https://phabricator.wikimedia.org/T315510#8392683 for me? which wiki has it reached? [14:35:41] * urbanecm checks if the output is readable by me [14:36:14] MatmaRex: this is the last few lines https://www.irccloud.com/pastebin/qjhwMT24/ [14:36:34] thanks [14:38:02] MatmaRex: fwiw, there are some errors. Do you want a list now, or should it just wait for later, after the script finishes everywhere? [14:38:18] it's "Error while processing revid=X, pageid=Y" kind of errors [14:38:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P39744 and previous config saved to /var/cache/conftool/dbconfig/20221115-143848-ladsgroup.json [14:40:03] urbanecm: they are probably all boring parsing timeouts and known rare crashes [14:40:08] okay [14:40:35] urbanecm: feel free to put them on the task if you're bored, but i won't look into them now [14:40:50] !log failover ganeti master in esams to ganeti3001 [14:40:50] * urbanecm is never bored :D [14:40:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:42] jouncebot: nowandnext [14:41:42] For the next 0 hour(s) and 18 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221115T1400) [14:41:42] For the next 0 hour(s) and 18 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221115T1400) [14:41:42] In 2 hour(s) and 18 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221115T1700) [14:41:51] let me know once you're done [14:41:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P39745 and previous config saved to /var/cache/conftool/dbconfig/20221115-144158-marostegui.json [14:42:11] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudgw2003-dev.codfw.wmnet with OS bullseye [14:42:28] Amir1: i'm waiting on CI to merge my stuff. Zuul says 3 mins left [14:42:51] awesome [14:42:58] take your time [14:43:01] !log jhathaway@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'sync'. [14:43:05] !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['cp5032'] [14:43:08] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp5032'] [14:43:55] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet [14:45:57] papaul: <3 thanks! [14:46:22] PROBLEM - ganeti-wconfd running on ganeti3003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [14:47:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P39746 and previous config saved to /var/cache/conftool/dbconfig/20221115-144736-ladsgroup.json [14:50:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet [14:50:29] (03CR) 10Vgutierrez: "updated the VCL code to fix some issues and get an initial version that passes current tests, more patches incoming :)" [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson) [14:50:34] (03Merged) 10jenkins-bot: updateIsActiveFlagForMentees: Process all mentees [extensions/GrowthExperiments] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/856579 (https://phabricator.wikimedia.org/T318457) (owner: 10Urbanecm) [14:50:37] (03Merged) 10jenkins-bot: MentorStore: Use $wgRCMaxAge instead of INACTIVITY_THRESHOLD [extensions/GrowthExperiments] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/856988 (https://phabricator.wikimedia.org/T318457) (owner: 10Urbanecm) [14:50:43] there we go [14:51:14] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp5032'] [14:51:38] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:856579|updateIsActiveFlagForMentees: Process all mentees (T318457)]], [[gerrit:856988|MentorStore: Use $wgRCMaxAge instead of INACTIVITY_THRESHOLD (T318457)]] [14:51:42] T318457: Enable "Your unstarred mentees" at the biggest Growth wikis - https://phabricator.wikimedia.org/T318457 [14:52:09] !log urbanecm@deploy1002 urbanecm and urbanecm: Backport for [[gerrit:856579|updateIsActiveFlagForMentees: Process all mentees (T318457)]], [[gerrit:856988|MentorStore: Use $wgRCMaxAge instead of INACTIVITY_THRESHOLD (T318457)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet [14:52:10] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2032.codfw.wmnet [14:53:04] !log jhathaway@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'sync'. [14:53:28] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10Papaul) [14:53:40] (03CR) 10Urbanecm: "lgtm, but needs the depends-on patches to reach all our wikis (=when wmf.11 starts to rollout)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/856008 (owner: 10Sergio Gimeno) [14:53:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P39747 and previous config saved to /var/cache/conftool/dbconfig/20221115-145355-ladsgroup.json [14:55:30] !log installing tomcat security updates [14:55:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:34] jouncebot: nowandnext [14:56:34] For the next 0 hour(s) and 3 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221115T1400) [14:56:34] For the next 0 hour(s) and 3 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221115T1400) [14:56:35] In 2 hour(s) and 3 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221115T1700) [14:57:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P39748 and previous config saved to /var/cache/conftool/dbconfig/20221115-145704-marostegui.json [14:57:22] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:856579|updateIsActiveFlagForMentees: Process all mentees (T318457)]], [[gerrit:856988|MentorStore: Use $wgRCMaxAge instead of INACTIVITY_THRESHOLD (T318457)]] (duration: 05m 44s) [14:57:26] T318457: Enable "Your unstarred mentees" at the biggest Growth wikis - https://phabricator.wikimedia.org/T318457 [14:58:00] Amir1: over to you! [14:58:10] awesome [14:58:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2032.codfw.wmnet [14:58:31] (03PS1) 10Elukey: turnilo: add webrequest_sampled_live datasource [puppet] - 10https://gerrit.wikimedia.org/r/856991 (https://phabricator.wikimedia.org/T314981) [14:58:57] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1] "PCC NOOP https://puppet-compiler.wmflabs.org/pcc-worker1002/38179/" [puppet] - 10https://gerrit.wikimedia.org/r/856963 (owner: 10Arturo Borrero Gonzalez) [14:59:22] !log Run `time mwscript extensions/GrowthExperiments/maintenance/updateIsActiveFlagForMentees.php --wiki=frwiki` at mwmaint1002 (T318457) [14:59:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:46] (03CR) 10Btullis: "Adding Steve, who is currently working on the turnilo config as well." [puppet] - 10https://gerrit.wikimedia.org/r/856991 (https://phabricator.wikimedia.org/T314981) (owner: 10Elukey) [15:01:21] (03PS2) 10Arturo Borrero Gonzalez: cloudgw1001: prepare for reimage into the new vlan NIC name with a single NIC [puppet] - 10https://gerrit.wikimedia.org/r/856969 (https://phabricator.wikimedia.org/T319184) [15:01:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2102.codfw.wmnet with reason: Maintenance [15:01:46] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti3003.esams.wmnet [15:02:01] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2102.codfw.wmnet with reason: Maintenance [15:02:31] RECOVERY - Postgres Replication Lag on maps2005 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 3003 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [15:02:32] (03CR) 10Vivian Rook: [C: 03+1] cloudgw: codfw1dev: move common hiera keys to the profile [puppet] - 10https://gerrit.wikimedia.org/r/856963 (owner: 10Arturo Borrero Gonzalez) [15:02:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T318955)', diff saved to https://phabricator.wikimedia.org/P39749 and previous config saved to /var/cache/conftool/dbconfig/20221115-150242-ladsgroup.json [15:02:47] T318955: Drop fr_comment and fr_text from flaggedrevs table in production - https://phabricator.wikimedia.org/T318955 [15:02:54] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1 C: 03+2] cloudgw: codfw1dev: move common hiera keys to the profile [puppet] - 10https://gerrit.wikimedia.org/r/856963 (owner: 10Arturo Borrero Gonzalez) [15:03:02] (03CR) 10Filippo Giunchedi: "Can't meaningfully vote but LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/856991 (https://phabricator.wikimedia.org/T314981) (owner: 10Elukey) [15:03:53] RECOVERY - Check systemd state on maps2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:04:19] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [15:05:47] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [15:06:16] !log hnowlan@cumin1001 START - Cookbook sre.postgresql.postgres-init [15:08:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti3003.esams.wmnet [15:09:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T318950)', diff saved to https://phabricator.wikimedia.org/P39750 and previous config saved to /var/cache/conftool/dbconfig/20221115-150901-ladsgroup.json [15:09:06] T318950: Fix renamed indexes of flaggedrevs table in production - https://phabricator.wikimedia.org/T318950 [15:09:57] !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.postgresql.postgres-init (exit_code=99) [15:10:03] !log hnowlan@cumin1001 START - Cookbook sre.postgresql.postgres-init [15:12:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T321126)', diff saved to https://phabricator.wikimedia.org/P39751 and previous config saved to /var/cache/conftool/dbconfig/20221115-151211-marostegui.json [15:12:13] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1197.eqiad.wmnet with reason: Maintenance [15:12:16] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [15:12:26] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1197.eqiad.wmnet with reason: Maintenance [15:12:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1197 (T321126)', diff saved to https://phabricator.wikimedia.org/P39752 and previous config saved to /var/cache/conftool/dbconfig/20221115-151232-marostegui.json [15:12:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T321130)', diff saved to https://phabricator.wikimedia.org/P39753 and previous config saved to /var/cache/conftool/dbconfig/20221115-151241-marostegui.json [15:12:45] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [15:14:11] !log installing expat security updates [15:14:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:30] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host dbprov2004.codfw.wmnet with OS bullseye [15:14:35] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence-Backup: Q1:rack/setup/install dbprov2004 - https://phabricator.wikimedia.org/T321128 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host dbprov2004.codfw.wmnet with OS bullseye [15:14:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T321126)', diff saved to https://phabricator.wikimedia.org/P39754 and previous config saved to /var/cache/conftool/dbconfig/20221115-151451-marostegui.json [15:15:28] !log Run `time mwscript extensions/GrowthExperiments/maintenance/updateIsActiveFlagForMentees.php --wiki=bnwiki` at mwmaint1002 (T318457) [15:15:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:33] T318457: Enable "Your unstarred mentees" at the biggest Growth wikis - https://phabricator.wikimedia.org/T318457 [15:16:55] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10Papaul) CP5032 firmware info . This server is ready for OS install. ` System BIOS Version = 1.7.5 Firmware Version = 6.00.30.00 ` Note it looks like PSU1 is not plugged `... [15:21:41] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [15:22:21] PROBLEM - SSH on aux-k8s-ctrl1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:22:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T318605)', diff saved to https://phabricator.wikimedia.org/P39755 and previous config saved to /var/cache/conftool/dbconfig/20221115-152224-ladsgroup.json [15:22:31] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [15:23:38] (03PS1) 10Majavah: mobileapps: Bump container image [deployment-charts] - 10https://gerrit.wikimedia.org/r/856994 [15:23:41] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [15:23:58] (KubernetesCalicoDown) firing: aux-k8s-ctrl1002.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Faux-k8s&var-instance=aux-k8s-ctrl1002.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:24:11] RECOVERY - SSH on aux-k8s-ctrl1002 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:24:11] PROBLEM - Check systemd state on aux-k8s-ctrl1002 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-journal-flush.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:24:17] (03CR) 10Majavah: [C: 03+2] mobileapps: Bump container image [deployment-charts] - 10https://gerrit.wikimedia.org/r/856994 (owner: 10Majavah) [15:25:39] Hey, What file do we need to work with now to change the site's logo? [15:26:06] I can't see the logo, word mark and tagline section in InitialiseSettings.php [15:27:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P39756 and previous config saved to /var/cache/conftool/dbconfig/20221115-152747-marostegui.json [15:28:02] (03CR) 10Andrew Bogott: "I don't entirely understand the cookbooks where you've changed the imports and inheritance but didn't replace run() with run_with_proxy()." [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/852960 (https://phabricator.wikimedia.org/T319426) (owner: 10David Caro) [15:28:45] (03Merged) 10jenkins-bot: mobileapps: Bump container image [deployment-charts] - 10https://gerrit.wikimedia.org/r/856994 (owner: 10Majavah) [15:28:58] (KubernetesCalicoDown) resolved: aux-k8s-ctrl1002.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Faux-k8s&var-instance=aux-k8s-ctrl1002.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:29:46] (03CR) 10BBlack: [C: 03+1] varnish: set expandtab in vim modeline [puppet] - 10https://gerrit.wikimedia.org/r/854104 (owner: 10BCornwall) [15:29:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P39757 and previous config saved to /var/cache/conftool/dbconfig/20221115-152957-marostegui.json [15:30:24] !log taavi@deploy1002 helmfile [staging] START helmfile.d/services/mobileapps: apply [15:30:42] (03CR) 10FNegri: [C: 04-1] ceph.roll_restart_*daemons: allow ignoring current health issues (032 comments) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/856933 (owner: 10David Caro) [15:30:48] !log taavi@deploy1002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [15:31:38] !log taavi@deploy1002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [15:32:03] (03PS1) 10Muehlenhoff: Add a new cookbook to roll-restart/reboot Swift proxies (also Thanos frontends) [cookbooks] - 10https://gerrit.wikimedia.org/r/856996 [15:32:31] !log taavi@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [15:32:36] !log taavi@deploy1002 helmfile [codfw] START helmfile.d/services/mobileapps: apply [15:33:29] !log taavi@deploy1002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [15:34:05] (03CR) 10Stevemunene: [C: 03+2] turnilo: add webrequest_sampled_live datasource [puppet] - 10https://gerrit.wikimedia.org/r/856991 (https://phabricator.wikimedia.org/T314981) (owner: 10Elukey) [15:34:45] RECOVERY - cassandra-a CQL 10.64.48.119:9042 on aqs1019 is OK: TCP OK - 0.000 second response time on 10.64.48.119 port 9042 https://phabricator.wikimedia.org/T93886 [15:35:26] thanks steve_munene ! [15:35:43] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:35:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST jobs) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=aux-k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:37:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P39758 and previous config saved to /var/cache/conftool/dbconfig/20221115-153731-ladsgroup.json [15:39:15] The text at the top of this file needs to be updated. Since the logo is now updated directly from here, https://noc.wikimedia.org/conf/highlight.php?file=logos.php [15:39:45] (03CR) 10FNegri: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/856675 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez) [15:40:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST jobs) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=aux-k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:41:39] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:42:10] (03PS1) 10Majavah: admin: update my .bashrc [puppet] - 10https://gerrit.wikimedia.org/r/856998 [15:42:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P39759 and previous config saved to /var/cache/conftool/dbconfig/20221115-154253-marostegui.json [15:43:03] !log uploaded cas 6.6.2 to apt.wikimedia.org T311235 [15:43:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:08] T311235: Update CAS to 6.6 - https://phabricator.wikimedia.org/T311235 [15:44:03] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on ganeti1022.eqiad.wmnet with reason: Remove from cluster for eventual reimage [15:44:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on ganeti1022.eqiad.wmnet with reason: Remove from cluster for eventual reimage [15:44:35] (03CR) 10David Caro: wmcs: add socks proxy support to wmcs cookbooks (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/852960 (https://phabricator.wikimedia.org/T319426) (owner: 10David Caro) [15:45:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P39760 and previous config saved to /var/cache/conftool/dbconfig/20221115-154504-marostegui.json [15:45:37] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1022.eqiad.wmnet with OS bullseye [15:45:46] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1022.eqiad.wmnet with OS bullseye [15:48:32] (03PS1) 10Ssingh: install_server: update late_command.sh to include eqsin (Linux 5.10) [puppet] - 10https://gerrit.wikimedia.org/r/857000 (https://phabricator.wikimedia.org/T322048) [15:49:48] !log Run `time mwscript extensions/GrowthExperiments/maintenance/updateIsActiveFlagForMentees.php --wiki=ptwiki` at mwmaint1002 (T318457) [15:49:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:52] T318457: Enable "Your unstarred mentees" at the biggest Growth wikis - https://phabricator.wikimedia.org/T318457 [15:51:07] (03CR) 10CDanis: [C: 03+2] admin: update my .bashrc [puppet] - 10https://gerrit.wikimedia.org/r/856998 (owner: 10Majavah) [15:52:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P39761 and previous config saved to /var/cache/conftool/dbconfig/20221115-155237-ladsgroup.json [15:53:19] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:54:33] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dbprov2004.codfw.wmnet with reason: host reimage [15:54:38] 10SRE, 10Epic, 10Sustainability (MediaWiki-MultiDC): Prepare and improve the datacenter switchover procedure - https://phabricator.wikimedia.org/T154658 (10LSobanski) [15:57:00] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti-test2001.codfw.wmnet [15:58:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T321130)', diff saved to https://phabricator.wikimedia.org/P39762 and previous config saved to /var/cache/conftool/dbconfig/20221115-155800-marostegui.json [15:58:01] (03CR) 10Herron: [C: 03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/856934 (https://phabricator.wikimedia.org/T281266) (owner: 10Muehlenhoff) [15:58:02] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1175.eqiad.wmnet with reason: Maintenance [15:58:04] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [15:58:15] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1175.eqiad.wmnet with reason: Maintenance [15:58:17] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbprov2004.codfw.wmnet with reason: host reimage [15:58:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1175 (T321130)', diff saved to https://phabricator.wikimedia.org/P39763 and previous config saved to /var/cache/conftool/dbconfig/20221115-155821-marostegui.json [15:59:54] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1022.eqiad.wmnet with reason: host reimage [16:00:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T321126)', diff saved to https://phabricator.wikimedia.org/P39764 and previous config saved to /var/cache/conftool/dbconfig/20221115-160010-marostegui.json [16:00:12] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [16:00:15] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [16:00:26] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [16:00:33] (03CR) 10David Caro: ceph.roll_restart_*daemons: allow ignoring current health issues (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/856933 (owner: 10David Caro) [16:00:41] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2097.codfw.wmnet with reason: Maintenance [16:00:44] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2097.codfw.wmnet with reason: Maintenance [16:01:20] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2104.codfw.wmnet with reason: Maintenance [16:01:34] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2104.codfw.wmnet with reason: Maintenance [16:01:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2104 (T321126)', diff saved to https://phabricator.wikimedia.org/P39765 and previous config saved to /var/cache/conftool/dbconfig/20221115-160140-marostegui.json [16:01:45] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1012.eqiad.wmnet [16:03:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1022.eqiad.wmnet with reason: host reimage [16:03:54] PROBLEM - Host ml-etcd1002 is DOWN: PING CRITICAL - Packet loss = 100% [16:03:58] PROBLEM - Host dse-k8s-etcd1003 is DOWN: PING CRITICAL - Packet loss = 100% [16:04:06] PROBLEM - Host kubetcd1004 is DOWN: PING CRITICAL - Packet loss = 100% [16:04:06] ^ expected due to ganeti1012 reboot [16:04:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti-test2001.codfw.wmnet [16:04:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104 (T321126)', diff saved to https://phabricator.wikimedia.org/P39766 and previous config saved to /var/cache/conftool/dbconfig/20221115-160419-marostegui.json [16:04:25] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti-test2002.codfw.wmnet [16:05:22] !log jhathaway@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'sync'. [16:05:30] RECOVERY - Host dse-k8s-etcd1003 is UP: PING OK - Packet loss = 0%, RTA = 0.55 ms [16:05:34] RECOVERY - Host ml-etcd1002 is UP: PING OK - Packet loss = 0%, RTA = 0.50 ms [16:05:36] RECOVERY - Host kubetcd1004 is UP: PING OK - Packet loss = 0%, RTA = 0.55 ms [16:06:28] !log jmm@cumin2002 START - Cookbook sre.wdqs.restart-nginx rolling restart_daemons on A:wdqs-all [16:07:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1012.eqiad.wmnet [16:07:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T321130)', diff saved to https://phabricator.wikimedia.org/P39767 and previous config saved to /var/cache/conftool/dbconfig/20221115-160742-marostegui.json [16:07:45] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1105.eqiad.wmnet with reason: Maintenance [16:07:47] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [16:07:59] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1105.eqiad.wmnet with reason: Maintenance [16:08:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3311 (T318605)', diff saved to https://phabricator.wikimedia.org/P39768 and previous config saved to /var/cache/conftool/dbconfig/20221115-160804-ladsgroup.json [16:08:10] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [16:08:25] (03CR) 10BCornwall: [C: 03+2] varnish: set expandtab in vim modeline [puppet] - 10https://gerrit.wikimedia.org/r/854104 (owner: 10BCornwall) [16:09:40] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:09:56] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:10:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti-test2002.codfw.wmnet [16:13:37] 10SRE, 10ops-eqsin, 10Infrastructure-Foundations, 10netops: asw1-eqsin: VC mastership change - https://phabricator.wikimedia.org/T323094 (10RobH) >>! In T323094#8395125, @Volans wrote: > I would also like to understand why eqsin was not immediately depooled when it happened. > The patch was already ready o... [16:13:50] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti-test2003.codfw.wmnet [16:14:40] !log initiating Cassandra bootstrap, aqs1019-b -- T307802 [16:14:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:44] T307802: Bootstrap new Cassandra nodes (eqiad) - https://phabricator.wikimedia.org/T307802 [16:15:28] !log jhathaway@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'sync'. [16:15:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.wdqs.restart-nginx (exit_code=0) rolling restart_daemons on A:wdqs-all [16:16:30] RECOVERY - cassandra-b SSL 10.64.48.122:7001 on aqs1019 is OK: SSL OK - Certificate aqs1019-b valid until 2024-11-08 15:06:32 +0000 (expires in 723 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [16:16:52] RECOVERY - cassandra-b service on aqs1019 is OK: OK - cassandra-b is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:17:05] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence-Backup: Q1:rack/setup/install dbprov2004 - https://phabricator.wikimedia.org/T321128 (10Papaul) @Volans All looks good on the R650 the only issue is that the provision cookbook didn't setup the serial communication like what happen with the R450. Do you want... [16:17:46] 10SRE, 10ops-eqsin, 10Infrastructure-Foundations, 10netops: asw1-eqsin: VC mastership change - https://phabricator.wikimedia.org/T323094 (10RobH) Order of operations, all of this was at or just before 5AM UTC. * Jin racks new hosts, and plugs in cp5032 - the port was NOT setup and the server was not in ne... [16:18:40] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbprov2004.codfw.wmnet with OS bullseye [16:18:44] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence-Backup: Q1:rack/setup/install dbprov2004 - https://phabricator.wikimedia.org/T321128 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host dbprov2004.codfw.wmnet with OS bullseye completed: - dbprov2004 (**WARN**)... [16:19:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104', diff saved to https://phabricator.wikimedia.org/P39769 and previous config saved to /var/cache/conftool/dbconfig/20221115-161925-marostegui.json [16:20:14] !log Run `time mwscript extensions/GrowthExperiments/maintenance/updateIsActiveFlagForMentees.php --wiki=arwiki` at mwmaint1002 (T318457) [16:20:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:18] T318457: Enable "Your unstarred mentees" at the biggest Growth wikis - https://phabricator.wikimedia.org/T318457 [16:20:40] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1014.eqiad.wmnet [16:20:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti-test2003.codfw.wmnet [16:20:54] 10SRE, 10Znuny, 10serviceops-collab, 10User-Matthewrbowker: Proposal: Centralize OTRS login methodology - https://phabricator.wikimedia.org/T133476 (10LSobanski) 05Open→03Declined I don't think the proposed solution is a viable one. If you still think VRTS login needs work please open a new task. [16:20:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH events) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=aux-k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:21:54] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1022.eqiad.wmnet with OS bullseye [16:22:02] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1022.eqiad.wmnet with OS bullseye completed: - ganeti1022 (**PASS**) - Downtimed on... [16:22:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P39770 and previous config saved to /var/cache/conftool/dbconfig/20221115-162249-marostegui.json [16:22:50] !log jmm@cumin2002 START - Cookbook sre.wdqs.restart-nginx rolling restart_daemons on A:wcqs-public [16:23:12] PROBLEM - Host ml-etcd1001 is DOWN: PING CRITICAL - Packet loss = 100% [16:23:28] PROBLEM - Host dse-k8s-etcd1002 is DOWN: PING CRITICAL - Packet loss = 100% [16:23:32] PROBLEM - Host netflow1002 is DOWN: PING CRITICAL - Packet loss = 100% [16:23:41] ^ expected due to ganeti1014 reboot [16:25:26] 10SRE, 10Pontoon, 10Patch-For-Review, 10User-fgiunchedi: Add PKI support to Pontoon - https://phabricator.wikimedia.org/T319163 (10fgiunchedi) [16:25:28] RECOVERY - Host netflow1002 is UP: PING WARNING - Packet loss = 80%, RTA = 0.77 ms [16:25:36] RECOVERY - Host dse-k8s-etcd1002 is UP: PING OK - Packet loss = 0%, RTA = 0.51 ms [16:25:38] RECOVERY - Host ml-etcd1001 is UP: PING OK - Packet loss = 0%, RTA = 0.46 ms [16:25:58] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (PATCH events) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=aux-k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:27:39] (03CR) 10Jbond: [C: 03+2] puppet_compiler: move clean up jobs and proxy site to db instance [puppet] - 10https://gerrit.wikimedia.org/r/856983 (owner: 10Jbond) [16:27:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1014.eqiad.wmnet [16:28:11] (03PS3) 10JMeybohm: k8s: Add a central ipv6dualstack flag to enable dual stack [puppet] - 10https://gerrit.wikimedia.org/r/856589 (https://phabricator.wikimedia.org/T307943) [16:28:13] (03PS1) 10JMeybohm: k8s: Fix duplicate definition of --service-account-key-file [puppet] - 10https://gerrit.wikimedia.org/r/857004 (https://phabricator.wikimedia.org/T307943) [16:29:58] (03PS2) 10Filippo Giunchedi: pki: move root common settings to profile [puppet] - 10https://gerrit.wikimedia.org/r/856603 (https://phabricator.wikimedia.org/T319163) [16:30:00] (03PS1) 10Filippo Giunchedi: pontoon: copy out the root pki ca [puppet] - 10https://gerrit.wikimedia.org/r/857006 (https://phabricator.wikimedia.org/T319163) [16:30:02] (03PS1) 10Filippo Giunchedi: pontoon: install Puppet and PKI CAs as certificates [puppet] - 10https://gerrit.wikimedia.org/r/857007 (https://phabricator.wikimedia.org/T319163) [16:31:05] (03CR) 10CI reject: [V: 04-1] pontoon: copy out the root pki ca [puppet] - 10https://gerrit.wikimedia.org/r/857006 (https://phabricator.wikimedia.org/T319163) (owner: 10Filippo Giunchedi) [16:31:32] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/857000 (https://phabricator.wikimedia.org/T322048) (owner: 10Ssingh) [16:31:57] (03PS1) 10JHathaway: aux-k8s: remove CoreDNS affinity rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/857009 (https://phabricator.wikimedia.org/T321120) [16:32:44] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38183/console" [puppet] - 10https://gerrit.wikimedia.org/r/857004 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [16:33:36] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence-Backup: Q1:rack/setup/install dbprov2004 - https://phabricator.wikimedia.org/T321128 (10jcrespo) @Papaul You probably are not asking me, but the work on the server is scheduled for next quarter, so feel free to do more tests/work with this server. We actuall... [16:34:05] !log ladsgroup: Deployed security patch for T320987 [16:34:07] (03CR) 10Btullis: Add a spark-operator chart and helmfile configuraiton (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis) [16:34:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104', diff saved to https://phabricator.wikimedia.org/P39772 and previous config saved to /var/cache/conftool/dbconfig/20221115-163432-marostegui.json [16:34:47] (03CR) 10JMeybohm: k8s: Fix duplicate definition of --service-account-key-file [puppet] - 10https://gerrit.wikimedia.org/r/857004 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [16:34:56] (03CR) 10JMeybohm: [V: 03+1] k8s: Fix duplicate definition of --service-account-key-file [puppet] - 10https://gerrit.wikimedia.org/r/857004 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [16:35:30] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, nice!" [cookbooks] - 10https://gerrit.wikimedia.org/r/856996 (owner: 10Muehlenhoff) [16:35:58] (KubernetesCalicoDown) firing: aux-k8s-ctrl1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Faux-k8s&var-instance=aux-k8s-ctrl1001.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [16:36:31] (03CR) 10Elukey: [C: 03+1] k8s: Fix duplicate definition of --service-account-key-file [puppet] - 10https://gerrit.wikimedia.org/r/857004 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [16:36:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.wdqs.restart-nginx (exit_code=0) rolling restart_daemons on A:wcqs-public [16:37:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2103.codfw.wmnet with reason: Maintenance [16:37:15] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2103.codfw.wmnet with reason: Maintenance [16:37:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2103 (T318605)', diff saved to https://phabricator.wikimedia.org/P39773 and previous config saved to /var/cache/conftool/dbconfig/20221115-163721-ladsgroup.json [16:37:25] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [16:37:36] (03CR) 10JHathaway: [C: 03+2] aux-k8s: remove CoreDNS affinity rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/857009 (https://phabricator.wikimedia.org/T321120) (owner: 10JHathaway) [16:37:47] !log ladsgroup: Deployed security patch for T320987 [16:37:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P39774 and previous config saved to /var/cache/conftool/dbconfig/20221115-163755-marostegui.json [16:38:13] (03CR) 10JHathaway: [V: 03+2 C: 03+2] aux-k8s: remove CoreDNS affinity rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/857009 (https://phabricator.wikimedia.org/T321120) (owner: 10JHathaway) [16:38:27] (03CR) 10Elukey: [C: 03+1] k8s: Add a central ipv6dualstack flag to enable dual stack [puppet] - 10https://gerrit.wikimedia.org/r/856589 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [16:39:58] (KubernetesRsyslogDown) firing: rsyslog on aux-k8s-worker1002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=aux-k8s-worker1002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [16:40:20] !log jhathaway@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'sync'. [16:40:20] (03CR) 10Vgutierrez: Varnish analytics: support differential privacy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson) [16:41:20] 10SRE, 10ops-eqsin, 10Infrastructure-Foundations, 10netops, 10Wikimedia-Incident: asw1-eqsin: VC mastership change - https://phabricator.wikimedia.org/T323094 (10jcrespo) [16:41:26] !log jmm@cumin2002 START - Cookbook sre.misc-clusters.roll-restart-reboot-docker-registry rolling restart_daemons on A:docker-registry [16:43:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.misc-clusters.roll-restart-reboot-docker-registry (exit_code=0) rolling restart_daemons on A:docker-registry [16:44:58] (KubernetesRsyslogDown) resolved: rsyslog on aux-k8s-worker1002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=aux-k8s-worker1002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [16:46:03] PROBLEM - Host aux-k8s-worker1001 is DOWN: PING CRITICAL - Packet loss = 100% [16:46:53] RECOVERY - Host aux-k8s-worker1001 is UP: PING OK - Packet loss = 0%, RTA = 0.48 ms [16:49:04] (03CR) 10Ssingh: [C: 03+2] install_server: update late_command.sh to include eqsin (Linux 5.10) [puppet] - 10https://gerrit.wikimedia.org/r/857000 (https://phabricator.wikimedia.org/T322048) (owner: 10Ssingh) [16:49:06] !log jhathaway@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'sync'. [16:49:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104 (T321126)', diff saved to https://phabricator.wikimedia.org/P39775 and previous config saved to /var/cache/conftool/dbconfig/20221115-164939-marostegui.json [16:49:41] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2125.codfw.wmnet with reason: Maintenance [16:49:43] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [16:49:55] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2125.codfw.wmnet with reason: Maintenance [16:50:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2125 (T321126)', diff saved to https://phabricator.wikimedia.org/P39776 and previous config saved to /var/cache/conftool/dbconfig/20221115-165001-marostegui.json [16:50:50] (03CR) 10BBlack: ""Pooled" can mean different things to different consumers of the metric. AIUI (but I could be wrong!): "server.enabled" is whether the ho" [debs/pybal] (1.15) - 10https://gerrit.wikimedia.org/r/845001 (https://phabricator.wikimedia.org/T321191) (owner: 10Filippo Giunchedi) [16:51:45] (JobUnavailable) firing: (2) Reduced availability for job calico-felix in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:51:58] (KubernetesCalicoDown) firing: aux-k8s-worker1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Faux-k8s&var-instance=aux-k8s-worker1001.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [16:52:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125 (T321126)', diff saved to https://phabricator.wikimedia.org/P39777 and previous config saved to /var/cache/conftool/dbconfig/20221115-165244-marostegui.json [16:53:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T321130)', diff saved to https://phabricator.wikimedia.org/P39778 and previous config saved to /var/cache/conftool/dbconfig/20221115-165302-marostegui.json [16:53:04] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1179.eqiad.wmnet with reason: Maintenance [16:53:07] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [16:53:17] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1179.eqiad.wmnet with reason: Maintenance [16:53:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1179 (T321130)', diff saved to https://phabricator.wikimedia.org/P39779 and previous config saved to /var/cache/conftool/dbconfig/20221115-165323-marostegui.json [16:54:02] (03PS1) 10Muehlenhoff: Enable profile::auto_restarts::service for Apache on deployment servers [puppet] - 10https://gerrit.wikimedia.org/r/857012 (https://phabricator.wikimedia.org/T135991) [16:55:58] (KubernetesCalicoDown) resolved: aux-k8s-ctrl1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Faux-k8s&var-instance=aux-k8s-ctrl1001.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [16:56:28] (KubernetesRsyslogDown) firing: (2) rsyslog on aux-k8s-worker1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [16:56:45] (JobUnavailable) resolved: (2) Reduced availability for job calico-felix in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:56:58] (KubernetesCalicoDown) resolved: aux-k8s-worker1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Faux-k8s&var-instance=aux-k8s-worker1001.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [16:57:23] (03CR) 10Ahmon Dancy: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/857012 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [16:58:35] PROBLEM - SSH on aux-k8s-ctrl1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [16:58:58] (KubernetesAPILatency) firing: (7) High Kubernetes API latency (GET endpoints) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=aux-k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:00:05] jbond and rzl: That opportune time is upon us again. Time for a Puppet request window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221115T1700). [17:00:05] No Gerrit patches in the queue for this window AFAICS. [17:00:26] (03CR) 10Dzahn: [C: 04-1] "the code says nginx but the intention is apache, right?" [puppet] - 10https://gerrit.wikimedia.org/r/857012 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [17:01:23] (03CR) 10David Caro: ceph: osd: introduce support for single NIC setup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/856675 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez) [17:02:07] !log Run `time mwscript extensions/GrowthExperiments/maintenance/updateIsActiveFlagForMentees.php --wiki=fawiki` at mwmaint1002 (T318457) [17:02:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:12] T318457: Enable "Your unstarred mentees" at the biggest Growth wikis - https://phabricator.wikimedia.org/T318457 [17:02:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T321130)', diff saved to https://phabricator.wikimedia.org/P39780 and previous config saved to /var/cache/conftool/dbconfig/20221115-170245-marostegui.json [17:02:50] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [17:03:58] (KubernetesAPILatency) resolved: (15) High Kubernetes API latency (GET clusterinformations) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=aux-k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:04:28] (KubernetesCalicoDown) firing: (2) aux-k8s-ctrl1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [17:04:45] (JobUnavailable) firing: Reduced availability for job k8s-api in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:05:22] (03PS1) 10Muehlenhoff: Extend cloudbackup Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/857014 [17:05:52] (03PS13) 10Vgutierrez: Varnish analytics: support differential privacy [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson) [17:06:28] (KubernetesRsyslogDown) resolved: (2) rsyslog on aux-k8s-worker1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [17:06:42] (03CR) 10Ahmon Dancy: Enable profile::auto_restarts::service for Apache on deployment servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/857012 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [17:07:18] (03PS5) 10Dzahn: dumps/distribution: fix values that don't fit into data types [puppet] - 10https://gerrit.wikimedia.org/r/855096 [17:07:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125', diff saved to https://phabricator.wikimedia.org/P39781 and previous config saved to /var/cache/conftool/dbconfig/20221115-170751-marostegui.json [17:08:30] (03PS6) 10Dzahn: dumps/distribution: fix values that don't fit into data types [puppet] - 10https://gerrit.wikimedia.org/r/855096 [17:08:43] (03PS1) 10Muehlenhoff: Add Cumin alias for dispatch [puppet] - 10https://gerrit.wikimedia.org/r/857015 [17:08:58] (03CR) 10Dzahn: "Hi Hannah, so I amended again and I am now using none.example.com. Does the current version look acceptable? Cheers, Dan" [puppet] - 10https://gerrit.wikimedia.org/r/855096 (owner: 10Dzahn) [17:09:25] PROBLEM - SSH on aux-k8s-ctrl1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [17:09:45] (JobUnavailable) firing: (2) Reduced availability for job k8s-api in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:10:09] (03PS1) 10Muehlenhoff: Add Cumin alias for orchestrator [puppet] - 10https://gerrit.wikimedia.org/r/857017 [17:10:10] !log add 150G to prometheus/ops in eqiad [17:10:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:39] (03PS1) 10Jbond: puppet_compiler: drop hostname from http_url [puppet] - 10https://gerrit.wikimedia.org/r/857018 [17:10:55] PROBLEM - Check systemd state on aux-k8s-ctrl1002 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:12:19] RECOVERY - SSH on aux-k8s-ctrl1002 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [17:12:53] RECOVERY - Check systemd state on aux-k8s-ctrl1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:13:00] (03PS2) 10Muehlenhoff: Enable profile::auto_restarts::service for Apache on deployment servers [puppet] - 10https://gerrit.wikimedia.org/r/857012 (https://phabricator.wikimedia.org/T135991) [17:13:24] (03CR) 10Muehlenhoff: Enable profile::auto_restarts::service for Apache on deployment servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/857012 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [17:13:39] PROBLEM - Check the NTP synchronisation status of timesyncd on aux-k8s-ctrl1001 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.0.42: Connection reset by peer https://wikitech.wikimedia.org/wiki/NTP [17:14:15] (03PS2) 10Ebernhardson: snapshot: Apply minor cleanups to cirrus dump script [puppet] - 10https://gerrit.wikimedia.org/r/856653 [17:14:17] (03CR) 10Ebernhardson: snapshot: Apply minor cleanups to cirrus dump script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/856653 (owner: 10Ebernhardson) [17:14:19] (03PS6) 10Ebernhardson: snapshot: Parallelize cirrus dumps by db shard [puppet] - 10https://gerrit.wikimedia.org/r/856654 (https://phabricator.wikimedia.org/T265056) [17:14:28] (KubernetesCalicoDown) resolved: aux-k8s-ctrl1002.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Faux-k8s&var-instance=aux-k8s-ctrl1002.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [17:14:45] (03CR) 10Jbond: [C: 03+2] puppet_compiler: drop hostname from http_url [puppet] - 10https://gerrit.wikimedia.org/r/857018 (owner: 10Jbond) [17:14:45] (JobUnavailable) resolved: Reduced availability for job k8s-api in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:16:10] !log Run `time mwscript extensions/GrowthExperiments/maintenance/updateIsActiveFlagForMentees.php --wiki=enwiki` at mwmaint1002 (T318457) [17:16:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:15] T318457: Enable "Your unstarred mentees" at the biggest Growth wikis - https://phabricator.wikimedia.org/T318457 [17:16:26] (03CR) 10Ahmon Dancy: [C: 03+1] Enable profile::auto_restarts::service for Apache on deployment servers [puppet] - 10https://gerrit.wikimedia.org/r/857012 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [17:16:57] (03PS1) 10Jbond: puppet_compiler: remove trailing slash [puppet] - 10https://gerrit.wikimedia.org/r/857019 [17:17:15] RECOVERY - SSH on aux-k8s-ctrl1001 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [17:17:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P39782 and previous config saved to /var/cache/conftool/dbconfig/20221115-171752-marostegui.json [17:18:51] PROBLEM - Host aux-k8s-ctrl1002 is DOWN: PING CRITICAL - Packet loss = 100% [17:19:38] (03CR) 10Jbond: [C: 03+2] puppet_compiler: remove trailing slash [puppet] - 10https://gerrit.wikimedia.org/r/857019 (owner: 10Jbond) [17:20:34] (03PS14) 10Vgutierrez: Varnish analytics: support differential privacy [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson) [17:20:35] 10SRE, 10Infrastructure-Foundations, 10netops: Migrate row E/F network aggregation to dedicated Spine switches - https://phabricator.wikimedia.org/T322937 (10cmooney) [17:21:28] (KubernetesAPILatency) firing: (5) High Kubernetes API latency (LIST cronjobs) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=aux-k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:22:17] RECOVERY - Host aux-k8s-ctrl1002 is UP: PING OK - Packet loss = 0%, RTA = 0.58 ms [17:22:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125', diff saved to https://phabricator.wikimedia.org/P39783 and previous config saved to /var/cache/conftool/dbconfig/20221115-172257-marostegui.json [17:24:13] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (PATCH events) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=aux-k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:24:49] (03CR) 10Dzahn: [C: 03+2] Enable profile::auto_restarts::service for Apache on deployment servers [puppet] - 10https://gerrit.wikimedia.org/r/857012 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [17:25:22] !log jhathaway@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'sync'. [17:25:23] !log jhathaway@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'sync'. [17:26:28] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (PATCH events) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=aux-k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:26:31] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38190/console" [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson) [17:26:34] (03CR) 10Filippo Giunchedi: Add 'pybal_server_pooled' metric (031 comment) [debs/pybal] (1.15) - 10https://gerrit.wikimedia.org/r/845001 (https://phabricator.wikimedia.org/T321191) (owner: 10Filippo Giunchedi) [17:27:10] (03CR) 10Dzahn: "re: license all looks good, thank you. but please see the comment from Volans in another matter" [puppet] - 10https://gerrit.wikimedia.org/r/856601 (https://phabricator.wikimedia.org/T237807) (owner: 10Hashar) [17:27:18] (03CR) 10Filippo Giunchedi: [C: 03+1] Add Cumin alias for dispatch [puppet] - 10https://gerrit.wikimedia.org/r/857015 (owner: 10Muehlenhoff) [17:28:03] (03CR) 10Dzahn: [C: 03+2] "systemd timers have been created on deploy1002 now" [puppet] - 10https://gerrit.wikimedia.org/r/857012 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [17:28:48] !log deploy1002:~] $ sudo systemctl start wmf_auto_restart_apache2.service [17:28:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:01] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [17:29:18] (03PS1) 10Jbond: puppet_compiler: output directory is actully output [puppet] - 10https://gerrit.wikimedia.org/r/857022 [17:29:20] (03CR) 10Dzahn: [C: 03+2] "tested by manually starting the service.[ deploy1002:~] $ sudo systemctl start wmf_auto_restart_apache2.service - Main PID: 31376 (code=e" [puppet] - 10https://gerrit.wikimedia.org/r/857012 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [17:29:22] !log jhathaway@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'sync'. [17:29:23] !log jhathaway@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'sync'. [17:29:32] (03PS2) 10Filippo Giunchedi: pontoon: copy out the root pki ca [puppet] - 10https://gerrit.wikimedia.org/r/857006 (https://phabricator.wikimedia.org/T319163) [17:29:34] (03PS2) 10Filippo Giunchedi: pontoon: install Puppet and PKI CAs as certificates [puppet] - 10https://gerrit.wikimedia.org/r/857007 (https://phabricator.wikimedia.org/T319163) [17:30:09] (03CR) 10Dzahn: [C: 03+2] "[cumin2002:~] $ sudo cumin 'P{O:dispatch::backend}' 'uname'" [puppet] - 10https://gerrit.wikimedia.org/r/857015 (owner: 10Muehlenhoff) [17:30:11] (03CR) 10Jbond: [C: 03+2] puppet_compiler: output directory is actully output [puppet] - 10https://gerrit.wikimedia.org/r/857022 (owner: 10Jbond) [17:31:01] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [17:32:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P39784 and previous config saved to /var/cache/conftool/dbconfig/20221115-173258-marostegui.json [17:33:29] (03PS1) 10Majavah: P:pontoon: include firewall rules to allow metricsinfra scraping [puppet] - 10https://gerrit.wikimedia.org/r/857023 [17:34:11] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence-Backup: Q1:rack/setup/install dbprov2004 - https://phabricator.wikimedia.org/T321128 (10Papaul) @jcrespo thank you for the update [17:37:00] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.postgresql.postgres-init (exit_code=0) [17:37:57] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: postgresql@11-main.service,prometheus-pg-replication-lag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:38:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125 (T321126)', diff saved to https://phabricator.wikimedia.org/P39785 and previous config saved to /var/cache/conftool/dbconfig/20221115-173804-marostegui.json [17:38:06] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2126.codfw.wmnet with reason: Maintenance [17:38:09] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [17:38:20] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2126.codfw.wmnet with reason: Maintenance [17:38:21] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db2095.codfw.wmnet with reason: Maintenance [17:38:26] (03PS1) 10Ssingh: cp5032: update site.pp and related configs for cp role [puppet] - 10https://gerrit.wikimedia.org/r/857024 (https://phabricator.wikimedia.org/T322048) [17:38:35] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2095.codfw.wmnet with reason: Maintenance [17:38:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2126 (T321126)', diff saved to https://phabricator.wikimedia.org/P39786 and previous config saved to /var/cache/conftool/dbconfig/20221115-173841-marostegui.json [17:39:31] (03CR) 10Dzahn: gerrit: script to report on git gc durations (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/856601 (https://phabricator.wikimedia.org/T237807) (owner: 10Hashar) [17:40:33] RECOVERY - Postgres Replication Lag on maps2006 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 1760 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [17:40:35] (03CR) 10Nskaggs: wmcs: add cookbook to add/remove a user to/from a project (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/851650 (owner: 10David Caro) [17:40:37] (03CR) 10Jbond: [C: 03+2] directories: add change id to the output dir [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/851118 (owner: 10Jbond) [17:40:46] (03CR) 10Jbond: [C: 03+2] puppet_compiler.differ: add support to filter by core type (032 comments) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/746947 (https://phabricator.wikimedia.org/T245828) (owner: 10Jbond) [17:40:53] (03CR) 10Jbond: [C: 03+2] worker: store catalogs as gziped file [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/852280 (https://phabricator.wikimedia.org/T222075) (owner: 10Jbond) [17:40:58] (03CR) 10Jbond: [C: 03+2] controller: fix get_states to avoid list reordering [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/852284 (https://phabricator.wikimedia.org/T224977) (owner: 10Jbond) [17:41:03] (03CR) 10Jbond: [C: 03+2] differ: add support for concat_fragment [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/852984 (https://phabricator.wikimedia.org/T286255) (owner: 10Jbond) [17:41:10] (03CR) 10Jbond: [C: 03+2] prepare: Allow specify a private repo change [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/852993 (https://phabricator.wikimedia.org/T265633) (owner: 10Jbond) [17:41:16] (03CR) 10Jbond: [C: 03+2] controller: Add option for basic pcc run [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/853382 (https://phabricator.wikimedia.org/T289666) (owner: 10Jbond) [17:41:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126 (T321126)', diff saved to https://phabricator.wikimedia.org/P39787 and previous config saved to /var/cache/conftool/dbconfig/20221115-174120-marostegui.json [17:41:27] (03CR) 10Jbond: [C: 03+2] nodegen: Fix issue when only one result is returned [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/856987 (owner: 10Jbond) [17:41:34] (03CR) 10Jbond: [C: 03+2] 2.5.0: prepare release [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/852837 (owner: 10Jbond) [17:41:55] (03CR) 10Ssingh: [C: 03+2] cp5032: update site.pp and related configs for cp role [puppet] - 10https://gerrit.wikimedia.org/r/857024 (https://phabricator.wikimedia.org/T322048) (owner: 10Ssingh) [17:41:59] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:42:59] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp5032.eqsin.wmnet with OS buster [17:43:07] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp5032.eqsin.wmnet with OS buster [17:43:23] (03CR) 10Filippo Giunchedi: "LGTM, see inline" [puppet] - 10https://gerrit.wikimedia.org/r/857023 (owner: 10Majavah) [17:43:29] (03Merged) 10jenkins-bot: directories: add change id to the output dir [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/851118 (owner: 10Jbond) [17:44:09] (03Merged) 10jenkins-bot: puppet_compiler.differ: add support to filter by core type [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/746947 (https://phabricator.wikimedia.org/T245828) (owner: 10Jbond) [17:44:11] (03Merged) 10jenkins-bot: worker: store catalogs as gziped file [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/852280 (https://phabricator.wikimedia.org/T222075) (owner: 10Jbond) [17:44:13] (03Merged) 10jenkins-bot: controller: fix get_states to avoid list reordering [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/852284 (https://phabricator.wikimedia.org/T224977) (owner: 10Jbond) [17:44:15] (03Merged) 10jenkins-bot: differ: add support for concat_fragment [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/852984 (https://phabricator.wikimedia.org/T286255) (owner: 10Jbond) [17:44:33] RECOVERY - Check the NTP synchronisation status of timesyncd on aux-k8s-ctrl1001 is OK: OK: synced at Tue 2022-11-15 17:44:32 UTC. https://wikitech.wikimedia.org/wiki/NTP [17:45:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T318605)', diff saved to https://phabricator.wikimedia.org/P39788 and previous config saved to /var/cache/conftool/dbconfig/20221115-174506-ladsgroup.json [17:45:11] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [17:45:59] (03Merged) 10jenkins-bot: prepare: Allow specify a private repo change [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/852993 (https://phabricator.wikimedia.org/T265633) (owner: 10Jbond) [17:46:09] (03Merged) 10jenkins-bot: controller: Add option for basic pcc run [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/853382 (https://phabricator.wikimedia.org/T289666) (owner: 10Jbond) [17:46:25] !log jhathaway@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'sync'. [17:46:28] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/857046 [17:46:30] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/857047 [17:46:37] (03Merged) 10jenkins-bot: nodegen: Fix issue when only one result is returned [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/856987 (owner: 10Jbond) [17:46:44] (03Merged) 10jenkins-bot: 2.5.0: prepare release [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/852837 (owner: 10Jbond) [17:48:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T321130)', diff saved to https://phabricator.wikimedia.org/P39789 and previous config saved to /var/cache/conftool/dbconfig/20221115-174805-marostegui.json [17:48:07] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1189.eqiad.wmnet with reason: Maintenance [17:48:10] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [17:48:21] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1189.eqiad.wmnet with reason: Maintenance [17:48:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1189 (T321130)', diff saved to https://phabricator.wikimedia.org/P39790 and previous config saved to /var/cache/conftool/dbconfig/20221115-174827-marostegui.json [17:49:48] (03PS1) 10Jbond: puppet_compiler: set group and bum version [puppet] - 10https://gerrit.wikimedia.org/r/857027 [17:50:33] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38191/console" [puppet] - 10https://gerrit.wikimedia.org/r/857027 (owner: 10Jbond) [17:54:02] (03CR) 10Jbond: [V: 03+1 C: 03+2] puppet_compiler: set group and bum version [puppet] - 10https://gerrit.wikimedia.org/r/857027 (owner: 10Jbond) [17:54:21] !log move pcc to 2.5.0 [17:54:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126', diff saved to https://phabricator.wikimedia.org/P39791 and previous config saved to /var/cache/conftool/dbconfig/20221115-175627-marostegui.json [18:00:03] (ProbeDown) firing: (2) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:00:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P39792 and previous config saved to /var/cache/conftool/dbconfig/20221115-180012-ladsgroup.json [18:01:34] (03PS2) 10Herron: dispatch: add apache redirect from default org to wikimedia org [puppet] - 10https://gerrit.wikimedia.org/r/856612 (https://phabricator.wikimedia.org/T313229) [18:01:41] PROBLEM - Check systemd state on ms-be1070 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:03:33] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38192/console" [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson) [18:04:47] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be1070 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [18:05:03] (ProbeDown) resolved: (2) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:06:03] !log Run `time mwscript extensions/GrowthExperiments/maintenance/updateIsActiveFlagForMentees.php --wiki=idwiki` at mwmaint1002 (T318457) [18:06:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:08] T318457: Enable "Your unstarred mentees" at the biggest Growth wikis - https://phabricator.wikimedia.org/T318457 [18:08:19] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence-Backup: Q1:rack/setup/install dbprov2004 - https://phabricator.wikimedia.org/T321128 (10Volans) >>! In T321128#8396595, @Papaul wrote: > @Volans All looks good on the R650 the only issue is that the provision cookbook didn't setup the serial communication li... [18:10:03] (03PS1) 10Jbond: puppet_compiler: bump version to 2.5.0 [puppet] - 10https://gerrit.wikimedia.org/r/857029 [18:10:19] (03PS2) 10Jbond: puppet_compiler: bump version to 2.5.0 [puppet] - 10https://gerrit.wikimedia.org/r/857029 [18:10:27] (03CR) 10Jbond: [V: 03+2 C: 03+2] puppet_compiler: bump version to 2.5.0 [puppet] - 10https://gerrit.wikimedia.org/r/857029 (owner: 10Jbond) [18:10:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2103 (T318605)', diff saved to https://phabricator.wikimedia.org/P39793 and previous config saved to /var/cache/conftool/dbconfig/20221115-181037-ladsgroup.json [18:10:44] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [18:11:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126', diff saved to https://phabricator.wikimedia.org/P39794 and previous config saved to /var/cache/conftool/dbconfig/20221115-181133-marostegui.json [18:11:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST virtualservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:13:18] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5032.eqsin.wmnet with reason: host reimage [18:14:46] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38193/console" [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson) [18:15:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P39795 and previous config saved to /var/cache/conftool/dbconfig/20221115-181519-ladsgroup.json [18:16:43] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5032.eqsin.wmnet with reason: host reimage [18:16:58] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST metrics) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:19:55] (03PS1) 10Cwhite: Add bullseye support. [debs/prometheus-logstash-exporter] - 10https://gerrit.wikimedia.org/r/857049 (https://phabricator.wikimedia.org/T321410) [18:20:16] (03PS1) 10Stang: logos: Remove duplicated code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/857030 [18:20:43] (03PS1) 10Jbond: DO NOt MEREGE: change to demon new reporting in pcc [puppet] - 10https://gerrit.wikimedia.org/r/857031 [18:21:40] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38194/console" [puppet] - 10https://gerrit.wikimedia.org/r/857031 (owner: 10Jbond) [18:21:57] (03PS9) 10David Caro: wmcs: add socks proxy support to wmcs cookbooks [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/852960 (https://phabricator.wikimedia.org/T319426) [18:22:01] (03CR) 10David Caro: wmcs: add socks proxy support to wmcs cookbooks (033 comments) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/852960 (https://phabricator.wikimedia.org/T319426) (owner: 10David Caro) [18:22:05] (03PS5) 10David Caro: wmcs.ceph.set_cluster_in_maintenance: fix bad parameter [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/855679 [18:22:13] (03PS2) 10David Caro: ceph.roll_restart_*daemons: allow ignoring current health issues [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/856933 [18:22:17] (03CR) 10David Caro: ceph.roll_restart_*daemons: allow ignoring current health issues (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/856933 (owner: 10David Caro) [18:25:05] (03CR) 10CI reject: [V: 04-1] wmcs: add socks proxy support to wmcs cookbooks [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/852960 (https://phabricator.wikimedia.org/T319426) (owner: 10David Caro) [18:25:13] (03CR) 10CI reject: [V: 04-1] ceph.roll_restart_*daemons: allow ignoring current health issues [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/856933 (owner: 10David Caro) [18:25:15] (03CR) 10CI reject: [V: 04-1] wmcs.ceph.set_cluster_in_maintenance: fix bad parameter [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/855679 (owner: 10David Caro) [18:25:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2103', diff saved to https://phabricator.wikimedia.org/P39796 and previous config saved to /var/cache/conftool/dbconfig/20221115-182545-ladsgroup.json [18:26:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126 (T321126)', diff saved to https://phabricator.wikimedia.org/P39797 and previous config saved to /var/cache/conftool/dbconfig/20221115-182640-marostegui.json [18:26:41] RECOVERY - Check systemd state on ms-be1070 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:26:42] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2138.codfw.wmnet with reason: Maintenance [18:26:46] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [18:27:06] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2138.codfw.wmnet with reason: Maintenance [18:27:10] (03CR) 10Cwhite: [C: 03+2] beta-logs: transition jobs host assignment to bullseye host [puppet] - 10https://gerrit.wikimedia.org/r/854111 (https://phabricator.wikimedia.org/T321410) (owner: 10Cwhite) [18:27:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2138:3312 (T321126)', diff saved to https://phabricator.wikimedia.org/P39798 and previous config saved to /var/cache/conftool/dbconfig/20221115-182712-marostegui.json [18:27:18] (03PS2) 10Cwhite: beta-logs: transition jobs host assignment to bullseye host [puppet] - 10https://gerrit.wikimedia.org/r/854111 (https://phabricator.wikimedia.org/T321410) [18:28:54] (03PS2) 10Jbond: DO NOt MEREGE: change to demon new reporting in pcc [puppet] - 10https://gerrit.wikimedia.org/r/857031 [18:29:30] (03CR) 10CI reject: [V: 04-1] DO NOt MEREGE: change to demon new reporting in pcc [puppet] - 10https://gerrit.wikimedia.org/r/857031 (owner: 10Jbond) [18:29:40] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38195/console" [puppet] - 10https://gerrit.wikimedia.org/r/857031 (owner: 10Jbond) [18:29:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312 (T321126)', diff saved to https://phabricator.wikimedia.org/P39799 and previous config saved to /var/cache/conftool/dbconfig/20221115-182955-marostegui.json [18:30:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T318605)', diff saved to https://phabricator.wikimedia.org/P39800 and previous config saved to /var/cache/conftool/dbconfig/20221115-183025-ladsgroup.json [18:30:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1106.eqiad.wmnet with reason: Maintenance [18:30:30] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [18:30:41] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1106.eqiad.wmnet with reason: Maintenance [18:30:42] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [18:30:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [18:30:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1106 (T318605)', diff saved to https://phabricator.wikimedia.org/P39801 and previous config saved to /var/cache/conftool/dbconfig/20221115-183053-ladsgroup.json [18:33:23] (03PS3) 10Jbond: DO NOt MEREGE: change to demon new reporting in pcc [puppet] - 10https://gerrit.wikimedia.org/r/857031 [18:33:43] (03CR) 10Ahmon Dancy: [C: 03+1] scap: update logstash_host for beta scap [puppet] - 10https://gerrit.wikimedia.org/r/854109 (https://phabricator.wikimedia.org/T321410) (owner: 10Cwhite) [18:33:58] (03CR) 10CI reject: [V: 04-1] DO NOt MEREGE: change to demon new reporting in pcc [puppet] - 10https://gerrit.wikimedia.org/r/857031 (owner: 10Jbond) [18:34:02] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38196/console" [puppet] - 10https://gerrit.wikimedia.org/r/857031 (owner: 10Jbond) [18:35:15] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be1070 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [18:36:55] (03PS1) 10Jbond: differ: exclude classes from core types [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/857032 [18:38:16] (03PS2) 10Stang: logos: Remove duplicated code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/857030 (https://phabricator.wikimedia.org/T307705) [18:38:55] (03CR) 10CI reject: [V: 04-1] differ: exclude classes from core types [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/857032 (owner: 10Jbond) [18:40:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2103', diff saved to https://phabricator.wikimedia.org/P39802 and previous config saved to /var/cache/conftool/dbconfig/20221115-184051-ladsgroup.json [18:41:09] (03PS2) 10Jbond: differ: exclude classes from core types [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/857032 [18:42:36] (03PS1) 10Jbond: puppet_compiler: bump to 2.5.1 [puppet] - 10https://gerrit.wikimedia.org/r/857033 [18:43:09] (03CR) 10CI reject: [V: 04-1] differ: exclude classes from core types [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/857032 (owner: 10Jbond) [18:43:57] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [18:44:11] (03PS3) 10Jbond: differ: exclude classes from core types [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/857032 [18:45:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312', diff saved to https://phabricator.wikimedia.org/P39803 and previous config saved to /var/cache/conftool/dbconfig/20221115-184501-marostegui.json [18:45:49] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [18:46:38] (03CR) 10Jbond: [C: 03+2] differ: exclude classes from core types [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/857032 (owner: 10Jbond) [18:47:37] (03CR) 10Jbond: [C: 03+2] puppet_compiler: bump to 2.5.1 [puppet] - 10https://gerrit.wikimedia.org/r/857033 (owner: 10Jbond) [18:49:22] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5032.eqsin.wmnet with OS buster [18:49:26] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp5032.eqsin.wmnet with OS buster completed: - cp5032 (**WARN**) - Removed from Puppet a... [18:50:14] (03PS10) 10David Caro: wmcs: add socks proxy support to wmcs cookbooks [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/852960 (https://phabricator.wikimedia.org/T319426) [18:50:16] (03PS6) 10David Caro: wmcs.ceph.set_cluster_in_maintenance: fix bad parameter [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/855679 [18:50:18] (03PS3) 10David Caro: ceph.roll_restart_*daemons: allow ignoring current health issues [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/856933 [18:51:37] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38198/console" [puppet] - 10https://gerrit.wikimedia.org/r/857031 (owner: 10Jbond) [18:51:51] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38197/console" [puppet] - 10https://gerrit.wikimedia.org/r/824769 (https://phabricator.wikimedia.org/T315676) (owner: 10Isaac Johnson) [18:53:28] (03CR) 10CI reject: [V: 04-1] ceph.roll_restart_*daemons: allow ignoring current health issues [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/856933 (owner: 10David Caro) [18:53:30] (03CR) 10CI reject: [V: 04-1] wmcs: add socks proxy support to wmcs cookbooks [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/852960 (https://phabricator.wikimedia.org/T319426) (owner: 10David Caro) [18:53:32] (03CR) 10CI reject: [V: 04-1] wmcs.ceph.set_cluster_in_maintenance: fix bad parameter [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/855679 (owner: 10David Caro) [18:54:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T321130)', diff saved to https://phabricator.wikimedia.org/P39804 and previous config saved to /var/cache/conftool/dbconfig/20221115-185457-marostegui.json [18:55:02] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [18:55:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2103 (T318605)', diff saved to https://phabricator.wikimedia.org/P39805 and previous config saved to /var/cache/conftool/dbconfig/20221115-185558-ladsgroup.json [18:56:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2116.codfw.wmnet with reason: Maintenance [18:56:02] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [18:56:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2116.codfw.wmnet with reason: Maintenance [18:56:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2116 (T318605)', diff saved to https://phabricator.wikimedia.org/P39806 and previous config saved to /var/cache/conftool/dbconfig/20221115-185619-ladsgroup.json [18:58:45] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp2042.codfw.wmnet with OS bullseye [18:58:53] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp2042.codfw.wmnet with OS bullseye [19:00:05] brennen and jeena: (Dis)respected human, time to deploy MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221115T1900). Please do the needful. [19:00:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312', diff saved to https://phabricator.wikimedia.org/P39807 and previous config saved to /var/cache/conftool/dbconfig/20221115-190008-marostegui.json [19:01:59] o/ [19:04:15] !log train 1.40.0-wmf.10 (T320515) - no current blockers, rolling to group0. [19:04:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:20] T320515: 1.40.0-wmf.10 deployment blockers - https://phabricator.wikimedia.org/T320515 [19:05:14] (03PS1) 10TrainBranchBot: group0 wikis to 1.40.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/857034 (https://phabricator.wikimedia.org/T320515) [19:05:16] (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.40.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/857034 (https://phabricator.wikimedia.org/T320515) (owner: 10TrainBranchBot) [19:05:59] (03Merged) 10jenkins-bot: group0 wikis to 1.40.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/857034 (https://phabricator.wikimedia.org/T320515) (owner: 10TrainBranchBot) [19:10:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P39808 and previous config saved to /var/cache/conftool/dbconfig/20221115-191003-marostegui.json [19:10:29] !log brennen@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.40.0-wmf.10 refs T320515 [19:10:33] T320515: 1.40.0-wmf.10 deployment blockers - https://phabricator.wikimedia.org/T320515 [19:13:12] (03PS11) 10David Caro: wmcs: add socks proxy support to wmcs cookbooks [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/852960 (https://phabricator.wikimedia.org/T319426) [19:13:14] (03PS7) 10David Caro: wmcs.ceph.set_cluster_in_maintenance: fix bad parameter [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/855679 [19:13:16] (03PS4) 10David Caro: ceph.roll_restart_*daemons: allow ignoring current health issues [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/856933 [19:15:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312 (T321126)', diff saved to https://phabricator.wikimedia.org/P39809 and previous config saved to /var/cache/conftool/dbconfig/20221115-191514-marostegui.json [19:15:17] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2148.codfw.wmnet with reason: Maintenance [19:15:20] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [19:15:25] (03PS1) 10JHathaway: aux-k8s: fix affinity for coredns [deployment-charts] - 10https://gerrit.wikimedia.org/r/857035 (https://phabricator.wikimedia.org/T321120) [19:15:30] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2148.codfw.wmnet with reason: Maintenance [19:15:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2148 (T321126)', diff saved to https://phabricator.wikimedia.org/P39810 and previous config saved to /var/cache/conftool/dbconfig/20221115-191536-marostegui.json [19:16:21] (03CR) 10CI reject: [V: 04-1] wmcs: add socks proxy support to wmcs cookbooks [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/852960 (https://phabricator.wikimedia.org/T319426) (owner: 10David Caro) [19:16:25] (03CR) 10CI reject: [V: 04-1] wmcs.ceph.set_cluster_in_maintenance: fix bad parameter [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/855679 (owner: 10David Caro) [19:16:29] (03CR) 10CI reject: [V: 04-1] ceph.roll_restart_*daemons: allow ignoring current health issues [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/856933 (owner: 10David Caro) [19:18:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T321126)', diff saved to https://phabricator.wikimedia.org/P39811 and previous config saved to /var/cache/conftool/dbconfig/20221115-191818-marostegui.json [19:18:28] (03CR) 10Andrew Bogott: [C: 03+1] wmcs: add socks proxy support to wmcs cookbooks [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/852960 (https://phabricator.wikimedia.org/T319426) (owner: 10David Caro) [19:20:52] (03CR) 10JHathaway: [C: 03+2] aux-k8s: fix affinity for coredns [deployment-charts] - 10https://gerrit.wikimedia.org/r/857035 (https://phabricator.wikimedia.org/T321120) (owner: 10JHathaway) [19:25:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P39812 and previous config saved to /var/cache/conftool/dbconfig/20221115-192509-marostegui.json [19:28:16] !log jhathaway@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'sync'. [19:28:28] !log jhathaway@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'sync'. [19:32:57] !log renumbering overlay vrf loopback interface lsw1-e3-eqiad [19:33:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P39813 and previous config saved to /var/cache/conftool/dbconfig/20221115-193324-marostegui.json [19:34:40] !log updated pcc to 2.5.1 [19:34:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:18] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:40:09] (03CR) 10Bking: [C: 03+2] [elastic,open]search: rip out unnecessary jvm options [puppet] - 10https://gerrit.wikimedia.org/r/838253 (owner: 10Ryan Kemper) [19:40:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T321130)', diff saved to https://phabricator.wikimedia.org/P39814 and previous config saved to /var/cache/conftool/dbconfig/20221115-194016-marostegui.json [19:40:18] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1198.eqiad.wmnet with reason: Maintenance [19:40:21] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [19:40:31] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1198.eqiad.wmnet with reason: Maintenance [19:40:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1198 (T321130)', diff saved to https://phabricator.wikimedia.org/P39815 and previous config saved to /var/cache/conftool/dbconfig/20221115-194037-marostegui.json [19:42:39] !log jhathaway@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'sync'. [19:42:52] !log jhathaway@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'sync'. [19:43:48] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:48:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P39816 and previous config saved to /var/cache/conftool/dbconfig/20221115-194830-marostegui.json [19:50:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T321130)', diff saved to https://phabricator.wikimedia.org/P39817 and previous config saved to /var/cache/conftool/dbconfig/20221115-195002-marostegui.json [19:50:07] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [19:50:35] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [debs/prometheus-logstash-exporter] - 10https://gerrit.wikimedia.org/r/857049 (https://phabricator.wikimedia.org/T321410) (owner: 10Cwhite) [19:51:07] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp2042.codfw.wmnet with OS bullseye [19:51:14] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp2042.codfw.wmnet with OS bullseye executed with errors: - cp2042 (**FAIL**) - Downtimed on Ic... [19:51:22] (03PS1) 10Volans: sre.hosts.provision: handle also PowerEdge R650 [cookbooks] - 10https://gerrit.wikimedia.org/r/857038 (https://phabricator.wikimedia.org/T321128) [19:53:28] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ssingh) >>! In T322048#8396337, @Papaul wrote: > CP5032 firmware info . This server is ready for OS install. > ` > System BIOS Version = 1.7.5 > Firmware Version = 6.00.30.... [19:54:00] (03CR) 10Bking: [C: 03+2] elastic: change java GC options to default for ES7 [puppet] - 10https://gerrit.wikimedia.org/r/838248 (https://phabricator.wikimedia.org/T319020) (owner: 10Bking) [19:54:20] (03PS10) 10Bking: elastic: change java GC options to default for ES7 [puppet] - 10https://gerrit.wikimedia.org/r/838248 (https://phabricator.wikimedia.org/T319020) [19:54:34] !log jhathaway@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'sync'. [19:54:36] !log jhathaway@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'sync'. [19:54:45] !log jhathaway@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'sync'. [19:55:14] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:55:19] !log jhathaway@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'sync'. [19:55:33] !log jhathaway@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'sync'. [19:55:37] !log jhathaway@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'sync'. [19:55:48] !log jhathaway@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'sync'. [19:55:49] !log jhathaway@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'sync'. [19:56:15] !log jhathaway@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'sync'. [19:56:16] !log jhathaway@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'sync'. [19:56:32] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:59:00] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48976 bytes in 2.662 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:00:20] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 3.774 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:00:52] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [20:02:50] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [20:03:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T321126)', diff saved to https://phabricator.wikimedia.org/P39818 and previous config saved to /var/cache/conftool/dbconfig/20221115-200337-marostegui.json [20:03:39] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2170.codfw.wmnet with reason: Maintenance [20:03:42] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [20:03:52] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2170.codfw.wmnet with reason: Maintenance [20:03:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2170:3312 (T321126)', diff saved to https://phabricator.wikimedia.org/P39819 and previous config saved to /var/cache/conftool/dbconfig/20221115-200358-marostegui.json [20:04:03] (03PS1) 10Stang: tnwiki: Set timezone to Africa/Gaborone (UTC+2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/857040 (https://phabricator.wikimedia.org/T318208) [20:05:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P39820 and previous config saved to /var/cache/conftool/dbconfig/20221115-200508-marostegui.json [20:06:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312 (T321126)', diff saved to https://phabricator.wikimedia.org/P39821 and previous config saved to /var/cache/conftool/dbconfig/20221115-200641-marostegui.json [20:06:58] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (PATCH nodes) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:10:40] PROBLEM - Check systemd state on elastic2070 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch_7@production-search-codfw.service,elasticsearch_7@production-search-omega-codfw.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:10:43] (03CR) 10Papaul: [C: 03+1] sre.hosts.provision: handle also PowerEdge R650 [cookbooks] - 10https://gerrit.wikimedia.org/r/857038 (https://phabricator.wikimedia.org/T321128) (owner: 10Volans) [20:11:50] (03CR) 10Volans: [C: 03+2] sre.hosts.provision: handle also PowerEdge R650 [cookbooks] - 10https://gerrit.wikimedia.org/r/857038 (https://phabricator.wikimedia.org/T321128) (owner: 10Volans) [20:13:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T318605)', diff saved to https://phabricator.wikimedia.org/P39822 and previous config saved to /var/cache/conftool/dbconfig/20221115-201348-ladsgroup.json [20:13:53] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [20:16:37] (03CR) 10Cwhite: [C: 03+2] scap: update logstash_host for beta scap [puppet] - 10https://gerrit.wikimedia.org/r/854109 (https://phabricator.wikimedia.org/T321410) (owner: 10Cwhite) [20:17:07] (03Merged) 10jenkins-bot: sre.hosts.provision: handle also PowerEdge R650 [cookbooks] - 10https://gerrit.wikimedia.org/r/857038 (https://phabricator.wikimedia.org/T321128) (owner: 10Volans) [20:20:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P39823 and previous config saved to /var/cache/conftool/dbconfig/20221115-202015-marostegui.json [20:21:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312', diff saved to https://phabricator.wikimedia.org/P39824 and previous config saved to /var/cache/conftool/dbconfig/20221115-202148-marostegui.json [20:22:42] (03PS1) 10JHathaway: aux-k8s: add pki intermediate for cfssl [puppet] - 10https://gerrit.wikimedia.org/r/857043 (https://phabricator.wikimedia.org/T321120) [20:23:11] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/857043 (https://phabricator.wikimedia.org/T321120) (owner: 10JHathaway) [20:27:08] PROBLEM - IPMI Sensor Status on cp5032 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical, Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [20:27:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116 (T318605)', diff saved to https://phabricator.wikimedia.org/P39825 and previous config saved to /var/cache/conftool/dbconfig/20221115-202733-ladsgroup.json [20:27:39] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [20:28:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P39826 and previous config saved to /var/cache/conftool/dbconfig/20221115-202854-ladsgroup.json [20:33:34] (03CR) 10JHathaway: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38200/console" [puppet] - 10https://gerrit.wikimedia.org/r/857043 (https://phabricator.wikimedia.org/T321120) (owner: 10JHathaway) [20:34:13] (03CR) 10JHathaway: [V: 03+1 C: 03+2] aux-k8s: add pki intermediate for cfssl [puppet] - 10https://gerrit.wikimedia.org/r/857043 (https://phabricator.wikimedia.org/T321120) (owner: 10JHathaway) [20:35:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T321130)', diff saved to https://phabricator.wikimedia.org/P39827 and previous config saved to /var/cache/conftool/dbconfig/20221115-203521-marostegui.json [20:35:23] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [20:35:26] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [20:35:27] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [20:36:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312', diff saved to https://phabricator.wikimedia.org/P39828 and previous config saved to /var/cache/conftool/dbconfig/20221115-203654-marostegui.json [20:39:12] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host dbprov2004.mgmt.codfw.wmnet with reboot policy GRACEFUL [20:42:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116', diff saved to https://phabricator.wikimedia.org/P39829 and previous config saved to /var/cache/conftool/dbconfig/20221115-204239-ladsgroup.json [20:44:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P39830 and previous config saved to /var/cache/conftool/dbconfig/20221115-204401-ladsgroup.json [20:44:37] !log jhathaway@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'sync'. [20:44:38] !log jhathaway@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'sync'. [20:49:20] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dbprov2004.mgmt.codfw.wmnet with reboot policy GRACEFUL [20:51:55] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2105.codfw.wmnet with reason: Maintenance [20:52:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312 (T321126)', diff saved to https://phabricator.wikimedia.org/P39831 and previous config saved to /var/cache/conftool/dbconfig/20221115-205201-marostegui.json [20:52:03] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2175.codfw.wmnet with reason: Maintenance [20:52:05] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [20:52:08] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2105.codfw.wmnet with reason: Maintenance [20:52:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2105 (T321130)', diff saved to https://phabricator.wikimedia.org/P39832 and previous config saved to /var/cache/conftool/dbconfig/20221115-205214-marostegui.json [20:52:16] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2175.codfw.wmnet with reason: Maintenance [20:52:19] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [20:52:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2175 (T321126)', diff saved to https://phabricator.wikimedia.org/P39833 and previous config saved to /var/cache/conftool/dbconfig/20221115-205222-marostegui.json [20:55:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T321126)', diff saved to https://phabricator.wikimedia.org/P39834 and previous config saved to /var/cache/conftool/dbconfig/20221115-205503-marostegui.json [20:55:46] (03PS1) 10JHathaway: aux-k8s: add deployment service [puppet] - 10https://gerrit.wikimedia.org/r/857045 (https://phabricator.wikimedia.org/T321120) [20:57:05] (03CR) 10JHathaway: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38201/console" [puppet] - 10https://gerrit.wikimedia.org/r/857045 (https://phabricator.wikimedia.org/T321120) (owner: 10JHathaway) [20:57:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116', diff saved to https://phabricator.wikimedia.org/P39835 and previous config saved to /var/cache/conftool/dbconfig/20221115-205746-ladsgroup.json [20:58:45] (03PS1) 10Ebernhardson: elasticsearch jvm.options: Only emit NewRatio when set [puppet] - 10https://gerrit.wikimedia.org/r/857066 [20:59:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T318605)', diff saved to https://phabricator.wikimedia.org/P39836 and previous config saved to /var/cache/conftool/dbconfig/20221115-205907-ladsgroup.json [20:59:09] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1107.eqiad.wmnet with reason: Maintenance [20:59:12] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [20:59:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1107.eqiad.wmnet with reason: Maintenance [20:59:29] (03CR) 10JHathaway: [V: 03+1 C: 03+2] aux-k8s: add deployment service [puppet] - 10https://gerrit.wikimedia.org/r/857045 (https://phabricator.wikimedia.org/T321120) (owner: 10JHathaway) [20:59:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1107 (T318605)', diff saved to https://phabricator.wikimedia.org/P39837 and previous config saved to /var/cache/conftool/dbconfig/20221115-205929-ladsgroup.json [20:59:35] (03CR) 10CI reject: [V: 04-1] elasticsearch jvm.options: Only emit NewRatio when set [puppet] - 10https://gerrit.wikimedia.org/r/857066 (owner: 10Ebernhardson) [21:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: Time to snap out of that daydream and deploy UTC late backport window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221115T2100). [21:00:05] cjming and cirno: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:21] i can deploy! [21:00:35] (03PS1) 10Effie Mouzeli: maps: add support for replication slots [puppet] - 10https://gerrit.wikimedia.org/r/857067 (https://phabricator.wikimedia.org/T290149) [21:00:39] starting with my patch [21:00:46] Thanks cjming :) [21:01:13] (03CR) 10CI reject: [V: 04-1] maps: add support for replication slots [puppet] - 10https://gerrit.wikimedia.org/r/857067 (https://phabricator.wikimedia.org/T290149) (owner: 10Effie Mouzeli) [21:01:21] (03PS5) 10Clare Ming: EditAttemptStep sampling rate to 1 everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/854570 (https://phabricator.wikimedia.org/T312016) (owner: 10Phuedx) [21:01:29] o/ [21:01:54] !log jhathaway@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'sync'. [21:01:56] (03PS2) 10Ebernhardson: elasticsearch jvm.options: Only emit NewRatio when set [puppet] - 10https://gerrit.wikimedia.org/r/857066 [21:02:00] hi cirno: i'll do your patches here soon [21:02:20] !log jhathaway@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'sync'. [21:02:40] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/854570 (https://phabricator.wikimedia.org/T312016) (owner: 10Phuedx) [21:03:27] (03Merged) 10jenkins-bot: EditAttemptStep sampling rate to 1 everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/854570 (https://phabricator.wikimedia.org/T312016) (owner: 10Phuedx) [21:03:54] !log cjming@deploy1002 Started scap: Backport for [[gerrit:854570|EditAttemptStep sampling rate to 1 everywhere (T312016)]] [21:03:59] T312016: Increase EditAttemptStep sampling rate(s) to 100% - https://phabricator.wikimedia.org/T312016 [21:04:16] (03CR) 10Bking: [C: 03+2] elasticsearch jvm.options: Only emit NewRatio when set [puppet] - 10https://gerrit.wikimedia.org/r/857066 (owner: 10Ebernhardson) [21:04:18] !log cjming@deploy1002 cjming and phuedx: Backport for [[gerrit:854570|EditAttemptStep sampling rate to 1 everywhere (T312016)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [21:04:43] (03PS3) 10Herron: dispatch: add apache redirect from default org to wikimedia org [puppet] - 10https://gerrit.wikimedia.org/r/856612 (https://phabricator.wikimedia.org/T313229) [21:05:14] (03PS4) 10Herron: dispatch: add apache redirect from default org to wikimedia org [puppet] - 10https://gerrit.wikimedia.org/r/856612 (https://phabricator.wikimedia.org/T313229) [21:05:44] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence-Backup: Q1:rack/setup/install dbprov2004 - https://phabricator.wikimedia.org/T321128 (10Papaul) [21:06:22] (03PS1) 10Cathal Mooney: Change sflow template to support updated loopback int name on evpn sw [homer/public] - 10https://gerrit.wikimedia.org/r/857068 (https://phabricator.wikimedia.org/T312635) [21:06:40] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence-Backup: Q1:rack/setup/install dbprov2004 - https://phabricator.wikimedia.org/T321128 (10Papaul) The serial communication issue we had was fixed by @Volans patch [21:07:13] (03PS2) 10Effie Mouzeli: maps: add support for replication slots [puppet] - 10https://gerrit.wikimedia.org/r/857067 (https://phabricator.wikimedia.org/T290149) [21:07:48] !log jhathaway@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'sync'. [21:07:50] !log jhathaway@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'sync'. [21:07:51] (03CR) 10Cathal Mooney: [C: 03+2] Change sflow template to support updated loopback int name on evpn sw [homer/public] - 10https://gerrit.wikimedia.org/r/857068 (https://phabricator.wikimedia.org/T312635) (owner: 10Cathal Mooney) [21:08:25] (03Merged) 10jenkins-bot: Change sflow template to support updated loopback int name on evpn sw [homer/public] - 10https://gerrit.wikimedia.org/r/857068 (https://phabricator.wikimedia.org/T312635) (owner: 10Cathal Mooney) [21:08:40] !log cjming@deploy1002 Finished scap: Backport for [[gerrit:854570|EditAttemptStep sampling rate to 1 everywhere (T312016)]] (duration: 04m 45s) [21:09:13] !log jhathaway@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'sync'. [21:09:14] !log jhathaway@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'sync'. [21:10:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P39838 and previous config saved to /var/cache/conftool/dbconfig/20221115-211009-marostegui.json [21:10:40] cirno: (aka koi?) your 1st patch is doing alot -- you marked as noop - is there anything to test other than making sure errors don't blow? [21:12:03] I tested locally and could make sure this version of that python script do exactly the same as previous version, so I think it's fine [21:12:24] alrighty - moving ahead then with 857030 [21:12:32] (for testing I mean command "generate" and "update" (on a few sites) [21:12:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116 (T318605)', diff saved to https://phabricator.wikimedia.org/P39839 and previous config saved to /var/cache/conftool/dbconfig/20221115-211253-ladsgroup.json [21:12:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2130.codfw.wmnet with reason: Maintenance [21:12:58] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [21:13:00] is there anything that needs to happen once it's on prod? [21:13:08] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2130.codfw.wmnet with reason: Maintenance [21:13:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2130 (T318605)', diff saved to https://phabricator.wikimedia.org/P39840 and previous config saved to /var/cache/conftool/dbconfig/20221115-211314-ladsgroup.json [21:13:33] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/857030 (https://phabricator.wikimedia.org/T307705) (owner: 10Stang) [21:13:37] this script doesn't run on prod - it run on local machine [21:14:07] so actually no need to scap :) [21:14:15] (03Merged) 10jenkins-bot: logos: Remove duplicated code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/857030 (https://phabricator.wikimedia.org/T307705) (owner: 10Stang) [21:14:40] !log cjming@deploy1002 Started scap: Backport for [[gerrit:857030|logos: Remove duplicated code (T307705)]] [21:14:44] T307705: Extend mw-config's logos management system to also cover wordmarks (wmgSiteLogoWordmark) - https://phabricator.wikimedia.org/T307705 [21:14:49] gtk - so i'll just go ahead and sync when prompted [21:15:04] !log cjming@deploy1002 cjming and stang: Backport for [[gerrit:857030|logos: Remove duplicated code (T307705)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [21:15:15] (03CR) 10Herron: dispatch: add apache redirect from default org to wikimedia org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/856612 (https://phabricator.wikimedia.org/T313229) (owner: 10Herron) [21:17:56] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [21:19:11] !log cjming@deploy1002 Finished scap: Backport for [[gerrit:857030|logos: Remove duplicated code (T307705)]] (duration: 04m 31s) [21:19:22] (03PS2) 10Clare Ming: tnwiki: Set timezone to Africa/Gaborone (UTC+2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/857040 (https://phabricator.wikimedia.org/T318208) (owner: 10Stang) [21:19:56] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [21:20:19] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/857040 (https://phabricator.wikimedia.org/T318208) (owner: 10Stang) [21:21:02] (03Merged) 10jenkins-bot: tnwiki: Set timezone to Africa/Gaborone (UTC+2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/857040 (https://phabricator.wikimedia.org/T318208) (owner: 10Stang) [21:21:25] !log cjming@deploy1002 Started scap: Backport for [[gerrit:857040|tnwiki: Set timezone to Africa/Gaborone (UTC+2) (T318208)]] [21:21:30] T318208: Change the auto time on the Setswana Wikipedia to (GMT+2) - https://phabricator.wikimedia.org/T318208 [21:21:49] !log cjming@deploy1002 cjming and stang: Backport for [[gerrit:857040|tnwiki: Set timezone to Africa/Gaborone (UTC+2) (T318208)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [21:22:38] cirno: 2nd patch on mwdebug [21:23:15] cjming: tested under incognito mode and the default time zone is set to UTC+2 as expected, so LGTM [21:23:25] cool - syncing [21:23:27] (03PS1) 10Cathal Mooney: Change hard-coded loopback int references in EVPN templates [homer/public] - 10https://gerrit.wikimedia.org/r/857069 (https://phabricator.wikimedia.org/T312635) [21:23:50] (03PS1) 10BCornwall: prometheus: Refactor ATS config monitoring [puppet] - 10https://gerrit.wikimedia.org/r/857070 (https://phabricator.wikimedia.org/T292815) [21:23:52] RECOVERY - Check systemd state on elastic2070 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:24:25] (03CR) 10CI reject: [V: 04-1] prometheus: Refactor ATS config monitoring [puppet] - 10https://gerrit.wikimedia.org/r/857070 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall) [21:25:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P39841 and previous config saved to /var/cache/conftool/dbconfig/20221115-212516-marostegui.json [21:25:38] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [21:27:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105 (T321130)', diff saved to https://phabricator.wikimedia.org/P39842 and previous config saved to /var/cache/conftool/dbconfig/20221115-212706-marostegui.json [21:27:11] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [21:27:14] (03CR) 10Cathal Mooney: [C: 03+2] Change hard-coded loopback int references in EVPN templates [homer/public] - 10https://gerrit.wikimedia.org/r/857069 (https://phabricator.wikimedia.org/T312635) (owner: 10Cathal Mooney) [21:27:40] !log cjming@deploy1002 Finished scap: Backport for [[gerrit:857040|tnwiki: Set timezone to Africa/Gaborone (UTC+2) (T318208)]] (duration: 06m 14s) [21:27:44] T318208: Change the auto time on the Setswana Wikipedia to (GMT+2) - https://phabricator.wikimedia.org/T318208 [21:27:48] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:28:08] (03Merged) 10jenkins-bot: Change hard-coded loopback int references in EVPN templates [homer/public] - 10https://gerrit.wikimedia.org/r/857069 (https://phabricator.wikimedia.org/T312635) (owner: 10Cathal Mooney) [21:28:10] (03CR) 10Cathal Mooney: [V: 03+2 C: 03+2] Change hard-coded loopback int references in EVPN templates [homer/public] - 10https://gerrit.wikimedia.org/r/857069 (https://phabricator.wikimedia.org/T312635) (owner: 10Cathal Mooney) [21:28:19] cirno: both your patches should be live [21:28:33] thanks! [21:28:43] your welcome :) [21:28:52] *you're [21:31:11] (03PS2) 10BCornwall: prometheus: Refactor ATS config monitoring [puppet] - 10https://gerrit.wikimedia.org/r/857070 (https://phabricator.wikimedia.org/T292815) [21:33:43] !log end of UTC late backport window [21:33:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:33:56] (03CR) 10CI reject: [V: 04-1] prometheus: Refactor ATS config monitoring [puppet] - 10https://gerrit.wikimedia.org/r/857070 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall) [21:34:15] (MjolnirUpdateFailureRateExceedesThreshold) firing: Data shipping to CirrusSearch in eqiad is experiencing abnormal failure rates - TODO - https://grafana.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates - https://alerts.wikimedia.org/?q=alertname%3DMjolnirUpdateFailureRateExceedesThreshold [21:37:47] (03PS1) 10Andrew Bogott: Patch cinder volume_type api to allow non-uuid project ids. [puppet] - 10https://gerrit.wikimedia.org/r/857073 (https://phabricator.wikimedia.org/T301949) [21:38:16] (03PS2) 10Andrew Bogott: Patch cinder volume_type api to allow non-uuid project ids. [puppet] - 10https://gerrit.wikimedia.org/r/857073 (https://phabricator.wikimedia.org/T301949) [21:39:15] (MjolnirUpdateFailureRateExceedesThreshold) resolved: Data shipping to CirrusSearch in eqiad is experiencing abnormal failure rates - TODO - https://grafana.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates - https://alerts.wikimedia.org/?q=alertname%3DMjolnirUpdateFailureRateExceedesThreshold [21:40:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T321126)', diff saved to https://phabricator.wikimedia.org/P39843 and previous config saved to /var/cache/conftool/dbconfig/20221115-214022-marostegui.json [21:40:28] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [21:41:58] (03PS3) 10Andrew Bogott: Patch cinder volume_type api to allow non-uuid project ids. [puppet] - 10https://gerrit.wikimedia.org/r/857073 (https://phabricator.wikimedia.org/T301949) [21:42:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105', diff saved to https://phabricator.wikimedia.org/P39844 and previous config saved to /var/cache/conftool/dbconfig/20221115-214212-marostegui.json [21:42:16] (03CR) 10Effie Mouzeli: "PCC OK https://puppet-compiler.wmflabs.org/output/857067/38202/" [puppet] - 10https://gerrit.wikimedia.org/r/857067 (https://phabricator.wikimedia.org/T290149) (owner: 10Effie Mouzeli) [21:51:32] PROBLEM - SSH on db1120.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:57:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105', diff saved to https://phabricator.wikimedia.org/P39845 and previous config saved to /var/cache/conftool/dbconfig/20221115-215719-marostegui.json [22:01:06] (03PS3) 10Effie Mouzeli: maps: add support for replication slots [puppet] - 10https://gerrit.wikimedia.org/r/857067 (https://phabricator.wikimedia.org/T290149) [22:06:28] 10SRE-Access-Requests: Jan Dittrich/Simulo asking for NDA for data access - https://phabricator.wikimedia.org/T317501 (10Stang) [22:07:05] (03PS1) 10Effie Mouzeli: maps: enable replication slots on maps1009 and maps1008 [puppet] - 10https://gerrit.wikimedia.org/r/857077 (https://phabricator.wikimedia.org/T290149) [22:08:15] 10SRE-Access-Requests: Jan Dittrich/Simulo asking for NDA for data access - https://phabricator.wikimedia.org/T317501 (10Stang) [22:11:35] (03PS4) 10Effie Mouzeli: maps: add support for replication slots [puppet] - 10https://gerrit.wikimedia.org/r/857067 (https://phabricator.wikimedia.org/T290149) [22:12:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105 (T321130)', diff saved to https://phabricator.wikimedia.org/P39846 and previous config saved to /var/cache/conftool/dbconfig/20221115-221225-marostegui.json [22:12:27] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2109.codfw.wmnet with reason: Maintenance [22:12:31] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [22:12:38] (03CR) 10Dzahn: [C: 04-1] "I can't get this to rebase properly for some reason. Either it has conflicts or it claims no changes to upload." [puppet] - 10https://gerrit.wikimedia.org/r/824805 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [22:12:41] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2109.codfw.wmnet with reason: Maintenance [22:12:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2109 (T321130)', diff saved to https://phabricator.wikimedia.org/P39847 and previous config saved to /var/cache/conftool/dbconfig/20221115-221247-marostegui.json [22:13:38] (03PS2) 10Effie Mouzeli: maps: enable replication slots on maps1009 and maps1008 [puppet] - 10https://gerrit.wikimedia.org/r/857077 (https://phabricator.wikimedia.org/T290149) [22:18:14] (03PS5) 10Effie Mouzeli: maps: add support for replication slots [puppet] - 10https://gerrit.wikimedia.org/r/857067 (https://phabricator.wikimedia.org/T290149) [22:18:53] (03PS3) 10Effie Mouzeli: maps: enable replication slots on maps1009 and maps1008 [puppet] - 10https://gerrit.wikimedia.org/r/857077 (https://phabricator.wikimedia.org/T290149) [22:19:13] (03PS4) 10Effie Mouzeli: maps: enable replication slots on maps1009 and maps1008 [puppet] - 10https://gerrit.wikimedia.org/r/857077 (https://phabricator.wikimedia.org/T290149) [22:24:33] (03PS2) 10Dzahn: dumps/phabricator: switch phab dumps host from phab1001 to phab1004 [puppet] - 10https://gerrit.wikimedia.org/r/824805 (https://phabricator.wikimedia.org/T280597) [22:27:01] (03PS1) 10Tim Starling: Feed: Use DerivativeContext and not clone main RequestContext [core] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/856582 (https://phabricator.wikimedia.org/T323153) [22:30:54] (03CR) 10Dzahn: "dumps file exists on phab1004 but it is NOT updated still" [puppet] - 10https://gerrit.wikimedia.org/r/824805 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [22:36:15] (03CR) 10Effie Mouzeli: "PCC OK https://puppet-compiler.wmflabs.org/output/857077/38213/" [puppet] - 10https://gerrit.wikimedia.org/r/857077 (https://phabricator.wikimedia.org/T290149) (owner: 10Effie Mouzeli) [22:39:01] (03PS1) 10Dzahn: phabricator: enable dumping on phab1004 [puppet] - 10https://gerrit.wikimedia.org/r/857079 (https://phabricator.wikimedia.org/T280597) [22:40:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1107 (T318605)', diff saved to https://phabricator.wikimedia.org/P39848 and previous config saved to /var/cache/conftool/dbconfig/20221115-224011-ladsgroup.json [22:40:17] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [22:42:54] (03PS1) 10Dzahn: phabricator: use systemd::sysuser for phd user, also on phab1004 [puppet] - 10https://gerrit.wikimedia.org/r/857081 (https://phabricator.wikimedia.org/T280597) [22:44:06] (03CR) 10Dzahn: "BAD vs GOOD:" [puppet] - 10https://gerrit.wikimedia.org/r/857081 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [22:45:29] (03CR) 10Dzahn: [C: 03+2] phabricator: use systemd::sysuser for phd user, also on phab1004 [puppet] - 10https://gerrit.wikimedia.org/r/857081 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [22:47:24] (03CR) 10Dzahn: [C: 03+2] "the user already existed with the UID/GID 920/920, so what this did was created /etc/sysusers.d/phd.conf and Systemd/Exec[Refresh sysusers" [puppet] - 10https://gerrit.wikimedia.org/r/857081 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [22:47:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T321130)', diff saved to https://phabricator.wikimedia.org/P39849 and previous config saved to /var/cache/conftool/dbconfig/20221115-224733-marostegui.json [22:47:39] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [22:48:06] (03CR) 10Dzahn: [C: 03+2] phabricator: use systemd::sysuser for phd user, also on phab1004 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/857081 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [22:51:20] !log phab1004 - running public_task_dump.py [22:51:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:55:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1107', diff saved to https://phabricator.wikimedia.org/P39850 and previous config saved to /var/cache/conftool/dbconfig/20221115-225518-ladsgroup.json [22:55:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130 (T318605)', diff saved to https://phabricator.wikimedia.org/P39851 and previous config saved to /var/cache/conftool/dbconfig/20221115-225537-ladsgroup.json [22:55:42] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [23:01:59] jouncebot: nowandnext [23:01:59] No deployments scheduled for the next 8 hour(s) and 58 minute(s) [23:01:59] In 8 hour(s) and 58 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221116T0800) [23:02:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P39852 and previous config saved to /var/cache/conftool/dbconfig/20221115-230240-marostegui.json [23:04:04] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by brennen@deploy1002 using scap backport" [core] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/856582 (https://phabricator.wikimedia.org/T323153) (owner: 10Tim Starling) [23:05:58] (03CR) 10Dzahn: "currently running the dump script manually fails with:" [puppet] - 10https://gerrit.wikimedia.org/r/857079 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [23:06:22] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 89, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:06:54] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 226, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:10:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1107', diff saved to https://phabricator.wikimedia.org/P39853 and previous config saved to /var/cache/conftool/dbconfig/20221115-231025-ladsgroup.json [23:10:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130', diff saved to https://phabricator.wikimedia.org/P39854 and previous config saved to /var/cache/conftool/dbconfig/20221115-231043-ladsgroup.json [23:16:38] * Krinkle debugging on mwdebug1001 [23:17:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P39855 and previous config saved to /var/cache/conftool/dbconfig/20221115-231746-marostegui.json [23:18:57] (03Merged) 10jenkins-bot: Feed: Use DerivativeContext and not clone main RequestContext [core] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/856582 (https://phabricator.wikimedia.org/T323153) (owner: 10Tim Starling) [23:18:59] !log ebernhardson@deploy1002 Started deploy [wikimedia/discovery/analytics@7762e35]: import_cirrus_indexes: snapshot partitioning should not use dashes [23:19:24] !log brennen@deploy1002 Started scap: Backport for [[gerrit:856582|Feed: Use DerivativeContext and not clone main RequestContext (T323153)]] [23:19:28] T323153: PHP Notice: Unexpected clearActionName after getActionName already called - https://phabricator.wikimedia.org/T323153 [23:19:50] !log brennen@deploy1002 brennen and tstarling: Backport for [[gerrit:856582|Feed: Use DerivativeContext and not clone main RequestContext (T323153)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [23:21:16] !log ebernhardson@deploy1002 Finished deploy [wikimedia/discovery/analytics@7762e35]: import_cirrus_indexes: snapshot partitioning should not use dashes (duration: 02m 16s) [23:25:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1107 (T318605)', diff saved to https://phabricator.wikimedia.org/P39856 and previous config saved to /var/cache/conftool/dbconfig/20221115-232532-ladsgroup.json [23:25:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1118.eqiad.wmnet with reason: Maintenance [23:25:37] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [23:25:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1118.eqiad.wmnet with reason: Maintenance [23:25:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130', diff saved to https://phabricator.wikimedia.org/P39857 and previous config saved to /var/cache/conftool/dbconfig/20221115-232550-ladsgroup.json [23:25:50] !log brennen@deploy1002 Finished scap: Backport for [[gerrit:856582|Feed: Use DerivativeContext and not clone main RequestContext (T323153)]] (duration: 06m 26s) [23:25:57] T323153: PHP Notice: Unexpected clearActionName after getActionName already called - https://phabricator.wikimedia.org/T323153 [23:26:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1118 (T318605)', diff saved to https://phabricator.wikimedia.org/P39858 and previous config saved to /var/cache/conftool/dbconfig/20221115-232600-ladsgroup.json [23:26:31] (03PS3) 10BCornwall: prometheus: Refactor ATS config monitoring [puppet] - 10https://gerrit.wikimedia.org/r/857070 (https://phabricator.wikimedia.org/T292815) [23:29:15] (03CR) 10CI reject: [V: 04-1] prometheus: Refactor ATS config monitoring [puppet] - 10https://gerrit.wikimedia.org/r/857070 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall) [23:32:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T321130)', diff saved to https://phabricator.wikimedia.org/P39859 and previous config saved to /var/cache/conftool/dbconfig/20221115-233253-marostegui.json [23:32:55] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2139.codfw.wmnet with reason: Maintenance [23:32:58] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [23:33:08] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2139.codfw.wmnet with reason: Maintenance [23:40:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130 (T318605)', diff saved to https://phabricator.wikimedia.org/P39860 and previous config saved to /var/cache/conftool/dbconfig/20221115-234056-ladsgroup.json [23:40:58] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2141.codfw.wmnet with reason: Maintenance [23:41:02] T318605: Deploy new externallinks fields to production - https://phabricator.wikimedia.org/T318605 [23:41:12] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2141.codfw.wmnet with reason: Maintenance [23:49:57] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (Kanban): Replace labstore100[67] with clouddumps100[12] - https://phabricator.wikimedia.org/T309346 (10Andrew) [23:51:01] 10SRE, 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission labstore100[67].wikimedia.org - https://phabricator.wikimedia.org/T319217 (10Andrew) 05Open→03Resolved No longer. ` andrew@cumin1001:~$ sudo cumin K{labstore*} "hostname -f" 4 hosts will... [23:54:12] !log andrew@cumin1001 START - Cookbook sre.hosts.decommission for hosts cloudmetrics[1001-1002].eqiad.wmnet [23:55:58] (03PS1) 10Andrew Bogott: Remove last trace of cloudmetrics100[12] [puppet] - 10https://gerrit.wikimedia.org/r/857087 (https://phabricator.wikimedia.org/T297444) [23:58:50] (03CR) 10Andrew Bogott: [C: 03+2] Remove last trace of cloudmetrics100[12] [puppet] - 10https://gerrit.wikimedia.org/r/857087 (https://phabricator.wikimedia.org/T297444) (owner: 10Andrew Bogott)