[00:02:43] FIRING: [3x] KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [00:03:10] FIRING: [3x] SystemdUnitFailed: imagecatalog_record.service on deploy2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:05:21] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T376905)', diff saved to https://phabricator.wikimedia.org/P69709 and previous config saved to /var/cache/conftool/dbconfig/20241014-000520-ladsgroup.json [00:06:55] FIRING: [4x] SystemdUnitFailed: imagecatalog_record.service on deploy2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:07:43] FIRING: [3x] KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [00:07:57] FIRING: [3x] KubernetesCalicoDown: kubestagemaster2003.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [00:07:58] FIRING: [3x] KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [00:08:40] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1079676 (owner: 10TrainBranchBot) [00:09:12] FIRING: [3x] KubernetesCalicoDown: kubestagemaster2003.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [00:09:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10224358 (10phaultfinder) [00:11:27] FIRING: [6x] ProbeDown: Service kubestagemaster2003:6443 has failed probes (http_staging_codfw_kube_apiserver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:11:55] FIRING: [5x] SystemdUnitFailed: imagecatalog_record.service on deploy2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:12:43] RESOLVED: [3x] KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [00:12:57] RESOLVED: [3x] KubernetesCalicoDown: kubestagemaster2003.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [00:15:57] RESOLVED: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [00:16:27] RESOLVED: [6x] ProbeDown: Service kubestagemaster2003:6443 has failed probes (http_staging_codfw_kube_apiserver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:16:55] RESOLVED: [5x] SystemdUnitFailed: imagecatalog_record.service on deploy2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:17:46] FIRING: [6x] ProbeDown: Service kubestagemaster2003:6443 has failed probes (http_staging_codfw_kube_apiserver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:19:27] FIRING: [3x] KubernetesCalicoDown: kubestagemaster2003.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [00:20:28] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P69710 and previous config saved to /var/cache/conftool/dbconfig/20241014-002027-ladsgroup.json [00:20:28] FIRING: KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [00:21:57] FIRING: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [00:24:09] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-staging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [00:25:28] FIRING: [2x] KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [00:26:55] FIRING: [6x] SystemdUnitFailed: imagecatalog_record.service on deploy2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:29:12] FIRING: [3x] KubernetesCalicoDown: kubestagemaster2003.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [00:30:28] FIRING: [2x] KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [00:31:55] RESOLVED: [6x] SystemdUnitFailed: imagecatalog_record.service on deploy2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:34:09] RESOLVED: [6x] KubernetesAPILatency: High Kubernetes API latency (GET ) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [00:34:12] FIRING: [3x] KubernetesCalicoDown: kubestagemaster2003.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [00:35:35] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P69711 and previous config saved to /var/cache/conftool/dbconfig/20241014-003534-ladsgroup.json [00:36:55] FIRING: [6x] SystemdUnitFailed: kube-controller-manager.service on kubestagemaster2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:41:40] FIRING: KubernetesRsyslogDown: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [00:41:55] RESOLVED: [5x] SystemdUnitFailed: kube-controller-manager.service on kubestagemaster2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:44:12] FIRING: [2x] KubernetesCalicoDown: kubestagemaster2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [00:44:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10224362 (10phaultfinder) [00:45:28] FIRING: [2x] KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [00:46:40] RESOLVED: KubernetesRsyslogDown: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [00:49:12] FIRING: [3x] KubernetesCalicoDown: kubestagemaster2003.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [00:50:28] RESOLVED: [2x] KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [00:50:42] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T376905)', diff saved to https://phabricator.wikimedia.org/P69712 and previous config saved to /var/cache/conftool/dbconfig/20241014-005042-ladsgroup.json [00:50:47] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1200.eqiad.wmnet with reason: Maintenance [00:50:49] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1200.eqiad.wmnet with reason: Maintenance [00:50:57] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1200 (T376905)', diff saved to https://phabricator.wikimedia.org/P69713 and previous config saved to /var/cache/conftool/dbconfig/20241014-005056-ladsgroup.json [00:51:57] RESOLVED: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [00:54:57] FIRING: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [00:55:43] FIRING: [2x] KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [00:58:50] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T376905)', diff saved to https://phabricator.wikimedia.org/P69714 and previous config saved to /var/cache/conftool/dbconfig/20241014-005849-ladsgroup.json [00:59:12] FIRING: [3x] KubernetesCalicoDown: kubestagemaster2003.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [01:00:43] RESOLVED: KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [01:01:55] FIRING: [2x] SystemdUnitFailed: kube-controller-manager.service on kubestagemaster2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:04:12] FIRING: [3x] KubernetesCalicoDown: kubestagemaster2003.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [01:04:57] RESOLVED: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [01:06:55] RESOLVED: [2x] SystemdUnitFailed: kube-controller-manager.service on kubestagemaster2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:09:12] FIRING: [3x] KubernetesCalicoDown: kubestagemaster2003.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [01:10:28] FIRING: KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [01:10:57] FIRING: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [01:12:25] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:13:57] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P69715 and previous config saved to /var/cache/conftool/dbconfig/20241014-011356-ladsgroup.json [01:14:12] FIRING: [3x] KubernetesCalicoDown: kubestagemaster2003.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [01:15:28] FIRING: [3x] KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [01:17:54] FIRING: [3x] SystemdUnitFailed: imagecatalog_record.service on deploy2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:22:40] FIRING: KubernetesRsyslogDown: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [01:24:12] FIRING: [3x] KubernetesCalicoDown: kubestagemaster2003.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [01:25:28] FIRING: [2x] KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [01:27:40] RESOLVED: KubernetesRsyslogDown: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [01:27:54] FIRING: [3x] SystemdUnitFailed: imagecatalog_record.service on deploy2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:29:04] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P69716 and previous config saved to /var/cache/conftool/dbconfig/20241014-012903-ladsgroup.json [01:29:12] FIRING: [3x] KubernetesCalicoDown: kubestagemaster2003.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [01:30:28] FIRING: [2x] KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [01:33:40] FIRING: KubernetesRsyslogDown: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [01:35:28] FIRING: [3x] KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [01:38:40] RESOLVED: KubernetesRsyslogDown: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [01:39:35] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10224439 (10phaultfinder) [01:44:11] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T376905)', diff saved to https://phabricator.wikimedia.org/P69717 and previous config saved to /var/cache/conftool/dbconfig/20241014-014410-ladsgroup.json [01:44:16] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1210.eqiad.wmnet with reason: Maintenance [01:44:29] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1210.eqiad.wmnet with reason: Maintenance [01:44:36] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1210 (T376905)', diff saved to https://phabricator.wikimedia.org/P69718 and previous config saved to /var/cache/conftool/dbconfig/20241014-014435-ladsgroup.json [01:45:28] FIRING: [3x] KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [01:47:54] FIRING: [3x] SystemdUnitFailed: imagecatalog_record.service on deploy2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:50:28] FIRING: [3x] KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [01:50:30] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210 (T376905)', diff saved to https://phabricator.wikimedia.org/P69719 and previous config saved to /var/cache/conftool/dbconfig/20241014-015030-ladsgroup.json [01:52:54] FIRING: [4x] SystemdUnitFailed: imagecatalog_record.service on deploy2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:54:12] FIRING: [3x] KubernetesCalicoDown: kubestagemaster2003.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [01:54:40] FIRING: KubernetesRsyslogDown: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [01:59:40] RESOLVED: KubernetesRsyslogDown: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [02:04:12] FIRING: [3x] KubernetesCalicoDown: kubestagemaster2003.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [02:05:12] RESOLVED: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [02:05:28] RESOLVED: [2x] KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [02:05:37] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210', diff saved to https://phabricator.wikimedia.org/P69720 and previous config saved to /var/cache/conftool/dbconfig/20241014-020537-ladsgroup.json [02:06:27] FIRING: [6x] ProbeDown: Service kubestagemaster2003:6443 has failed probes (http_staging_codfw_kube_apiserver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:07:40] FIRING: KubernetesRsyslogDown: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [02:07:54] FIRING: [4x] SystemdUnitFailed: imagecatalog_record.service on deploy2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:11:27] FIRING: [6x] ProbeDown: Service kubestagemaster2003:6443 has failed probes (http_staging_codfw_kube_apiserver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:11:58] FIRING: [2x] KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [02:12:54] FIRING: [5x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:12:57] FIRING: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [02:14:12] FIRING: [3x] KubernetesCalicoDown: kubestagemaster2003.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [02:16:58] FIRING: [3x] KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [02:17:54] RESOLVED: [5x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:19:38] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10224490 (10phaultfinder) [02:20:44] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210', diff saved to https://phabricator.wikimedia.org/P69721 and previous config saved to /var/cache/conftool/dbconfig/20241014-022044-ladsgroup.json [02:26:58] FIRING: [2x] KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [02:29:58] FIRING: ProbeDown: Service upload-https:443 has failed probes (http_upload-https_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#upload-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:31:44] FIRING: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [02:32:46] FIRING: [8x] ProbeDown: Service kubestagemaster2003:6443 has failed probes (http_staging_codfw_kube_apiserver_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:34:57] RESOLVED: [2x] ProbeDown: Service upload-https:443 has failed probes (http_upload-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#upload-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:35:51] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210 (T376905)', diff saved to https://phabricator.wikimedia.org/P69722 and previous config saved to /var/cache/conftool/dbconfig/20241014-023551-ladsgroup.json [02:35:56] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1213.eqiad.wmnet with reason: Maintenance [02:36:10] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1213.eqiad.wmnet with reason: Maintenance [02:36:16] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1213 (T376905)', diff saved to https://phabricator.wikimedia.org/P69723 and previous config saved to /var/cache/conftool/dbconfig/20241014-023616-ladsgroup.json [02:36:44] RESOLVED: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [02:36:58] RESOLVED: [2x] KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [02:37:13] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:39:12] FIRING: [3x] KubernetesCalicoDown: kubestagemaster2003.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [02:40:58] FIRING: [2x] KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [02:41:50] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1213 (T376905)', diff saved to https://phabricator.wikimedia.org/P69724 and previous config saved to /var/cache/conftool/dbconfig/20241014-024149-ladsgroup.json [02:42:46] FIRING: [7x] ProbeDown: Service kubestagemaster2003:6443 has failed probes (http_staging_codfw_kube_apiserver_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:42:53] FIRING: DDoSDetected: FastNetMon has detected an attack on eqsin #page - https://bit.ly/wmf-fastnetmon - https://w.wiki/8oU - https://alerts.wikimedia.org/?q=alertname%3DDDoSDetected [02:43:44] FIRING: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [02:44:24] FIRING: [3x] SystemdUnitFailed: kube-controller-manager.service on kubestagemaster2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:46:25] (03PS1) 10Tim Starling: Enable {{USERLANGUAGE}} on Commons and Meta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079680 (https://phabricator.wikimedia.org/T4085) [02:46:27] FIRING: [7x] ProbeDown: Service kubestagemaster2003:6443 has failed probes (http_staging_codfw_kube_apiserver_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:48:44] RESOLVED: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [02:49:27] FIRING: [3x] KubernetesCalicoDown: kubestagemaster2003.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [02:50:58] RESOLVED: [2x] KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [02:51:27] FIRING: [7x] ProbeDown: Service kubestagemaster2003:6443 has failed probes (http_staging_codfw_kube_apiserver_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:52:40] RESOLVED: KubernetesRsyslogDown: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [02:52:53] RESOLVED: DDoSDetected: FastNetMon has detected an attack on eqsin #page - https://bit.ly/wmf-fastnetmon - https://w.wiki/8oU - https://alerts.wikimedia.org/?q=alertname%3DDDoSDetected [02:52:57] RESOLVED: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [02:54:12] FIRING: [3x] KubernetesCalicoDown: kubestagemaster2003.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [02:54:24] FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_ssh.service on kubestagemaster2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:56:56] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1213', diff saved to https://phabricator.wikimedia.org/P69725 and previous config saved to /var/cache/conftool/dbconfig/20241014-025656-ladsgroup.json [02:59:24] FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_ssh.service on kubestagemaster2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:02:13] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:06:09] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-staging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [03:09:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10224496 (10phaultfinder) [03:12:03] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1213', diff saved to https://phabricator.wikimedia.org/P69726 and previous config saved to /var/cache/conftool/dbconfig/20241014-031203-ladsgroup.json [03:27:10] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1213 (T376905)', diff saved to https://phabricator.wikimedia.org/P69727 and previous config saved to /var/cache/conftool/dbconfig/20241014-032710-ladsgroup.json [03:27:15] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1216.eqiad.wmnet with reason: Maintenance [03:27:17] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1216.eqiad.wmnet with reason: Maintenance [03:32:17] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1230.eqiad.wmnet with reason: Maintenance [03:32:31] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1230.eqiad.wmnet with reason: Maintenance [03:32:37] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1230 (T376905)', diff saved to https://phabricator.wikimedia.org/P69728 and previous config saved to /var/cache/conftool/dbconfig/20241014-033237-ladsgroup.json [03:39:22] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1230 (T376905)', diff saved to https://phabricator.wikimedia.org/P69729 and previous config saved to /var/cache/conftool/dbconfig/20241014-033922-ladsgroup.json [03:44:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10224532 (10phaultfinder) [03:46:27] RESOLVED: [2x] ProbeDown: Service kubestagemaster2005:6443 has failed probes (http_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#kubestagemaster2005:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:46:59] (03CR) 10Tim Starling: [C:03+1] Missing.php: Redirect Scots Wiktionary to Scots Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078122 (https://phabricator.wikimedia.org/T249648) (owner: 10Pppery) [03:49:12] RESOLVED: KubernetesCalicoDown: kubestagemaster2005.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-staging&var-instance=kubestagemaster2005.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [03:51:12] (03CR) 10Bugreporter: "I think we should add it to CommonSetting instead (https://gerrit.wikimedia.org/g/operations/mediawiki-config/+/7cd0d3710ced29a9cf9c1632ed" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079680 (https://phabricator.wikimedia.org/T4085) (owner: 10Tim Starling) [03:54:29] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1230', diff saved to https://phabricator.wikimedia.org/P69730 and previous config saved to /var/cache/conftool/dbconfig/20241014-035429-ladsgroup.json [04:01:09] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-staging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [04:03:27] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, October 14 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078122 (https://phabricator.wikimedia.org/T249648) (owner: 10Pppery) [04:09:24] FIRING: [3x] SystemdUnitFailed: wmf_auto_restart_ssh.service on kubestagemaster2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:09:36] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1230', diff saved to https://phabricator.wikimedia.org/P69731 and previous config saved to /var/cache/conftool/dbconfig/20241014-040936-ladsgroup.json [04:11:27] FIRING: [2x] ProbeDown: Service kubestagemaster2003:6443 has failed probes (http_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#kubestagemaster2003:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:19:09] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-staging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [04:19:24] RESOLVED: [3x] SystemdUnitFailed: wmf_auto_restart_ssh.service on kubestagemaster2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:21:27] FIRING: [6x] ProbeDown: Service kubestagemaster2003:6443 has failed probes (http_staging_codfw_kube_apiserver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:22:57] FIRING: KubernetesCalicoDown: kubestagemaster2005.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-staging&var-instance=kubestagemaster2005.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [04:23:28] FIRING: KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [04:23:57] FIRING: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [04:24:09] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-staging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [04:24:36] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10224538 (10phaultfinder) [04:24:43] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1230 (T376905)', diff saved to https://phabricator.wikimedia.org/P69732 and previous config saved to /var/cache/conftool/dbconfig/20241014-042443-ladsgroup.json [04:24:48] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1245.eqiad.wmnet with reason: Maintenance [04:24:54] (03PS1) 10KartikMistry: Update MinT to 2024-10-11-113932-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079682 (https://phabricator.wikimedia.org/T368521) [04:25:02] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1245.eqiad.wmnet with reason: Maintenance [04:27:57] FIRING: [2x] KubernetesCalicoDown: kubestagemaster2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [04:28:28] RESOLVED: KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [04:30:01] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [04:30:15] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [04:30:54] FIRING: [3x] SystemdUnitFailed: wmf_auto_restart_ssh.service on kubestagemaster2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:32:57] FIRING: [3x] KubernetesCalicoDown: kubestagemaster2003.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [04:34:28] FIRING: KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [04:34:40] FIRING: KubernetesRsyslogDown: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [04:35:37] (03CR) 10Tim Starling: "What is the source for this list?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079054 (https://phabricator.wikimedia.org/T376923) (owner: 10Pppery) [04:35:54] RESOLVED: [3x] SystemdUnitFailed: wmf_auto_restart_ssh.service on kubestagemaster2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:37:57] FIRING: [3x] KubernetesCalicoDown: kubestagemaster2003.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [04:39:28] FIRING: [2x] KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [04:39:40] RESOLVED: KubernetesRsyslogDown: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [04:39:54] FIRING: [3x] SystemdUnitFailed: wmf_auto_restart_ssh.service on kubestagemaster2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:41:09] FIRING: [4x] KubernetesAPILatency: High Kubernetes API latency (GET ) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-staging&var-latency_percentile=0.95&var-verb=GET - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [04:42:57] FIRING: [3x] KubernetesCalicoDown: kubestagemaster2003.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [04:43:10] FIRING: [2x] KubernetesRsyslogDown: rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [04:44:28] FIRING: [2x] KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [04:46:09] RESOLVED: [4x] KubernetesAPILatency: High Kubernetes API latency (GET ) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-staging&var-latency_percentile=0.95&var-verb=GET - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [04:48:10] RESOLVED: [2x] KubernetesRsyslogDown: rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [04:54:55] FIRING: KubernetesRsyslogDown: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [04:59:10] FIRING: [2x] KubernetesRsyslogDown: rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [04:59:28] FIRING: [2x] KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [04:59:54] RESOLVED: SystemdUnitFailed: wmf_auto_restart_ssh.service on kubestagemaster2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:02:57] FIRING: [3x] KubernetesCalicoDown: kubestagemaster2003.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [05:04:49] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10224547 (10phaultfinder) [05:09:28] FIRING: [2x] KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [05:12:25] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:14:24] FIRING: SystemdUnitFailed: wmf_auto_restart_ssh.service on kubestagemaster2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:16:11] (03CR) 10Pppery: "I started with lists like https://meta.wikimedia.org/w/index.php?title=Wikisource#Wikisources_in_Wikipedia, https://meta.wikimedia.org/wik" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079054 (https://phabricator.wikimedia.org/T376923) (owner: 10Pppery) [05:19:24] FIRING: [3x] SystemdUnitFailed: imagecatalog_record.service on deploy2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:29:28] FIRING: [2x] KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [05:32:57] FIRING: [3x] KubernetesCalicoDown: kubestagemaster2003.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [05:34:10] RESOLVED: KubernetesRsyslogDown: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [05:34:28] RESOLVED: KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [05:37:57] FIRING: [3x] KubernetesCalicoDown: kubestagemaster2003.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [05:39:24] FIRING: [6x] SystemdUnitFailed: imagecatalog_record.service on deploy2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:41:27] FIRING: [6x] ProbeDown: Service kubestagemaster2003:6443 has failed probes (http_staging_codfw_kube_apiserver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:42:46] FIRING: [6x] ProbeDown: Service kubestagemaster2003:6443 has failed probes (http_staging_codfw_kube_apiserver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:42:57] FIRING: [3x] KubernetesCalicoDown: kubestagemaster2003.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [05:44:24] FIRING: [7x] SystemdUnitFailed: imagecatalog_record.service on deploy2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:52:46] FIRING: [6x] ProbeDown: Service kubestagemaster2003:6443 has failed probes (http_staging_codfw_kube_apiserver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:52:57] RESOLVED: KubernetesCalicoDown: kubestagemaster2003.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-staging&var-instance=kubestagemaster2003.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [05:54:09] FIRING: [2x] KubernetesAPILatency: High Kubernetes API latency (GET leases) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-staging&var-latency_percentile=0.95&var-verb=GET - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [05:57:46] FIRING: [6x] ProbeDown: Service kubestagemaster2003:6443 has failed probes (http_staging_codfw_kube_apiserver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:59:09] RESOLVED: [4x] KubernetesAPILatency: High Kubernetes API latency (GET leases) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:01:57] FIRING: KubernetesCalicoDown: kubestagemaster2005.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-staging&var-instance=kubestagemaster2005.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [06:04:24] FIRING: [2x] SystemdUnitFailed: imagecatalog_record.service on deploy2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:06:09] FIRING: KubernetesAPILatency: High Kubernetes API latency (GET endpoints) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-staging&var-latency_percentile=0.95&var-verb=GET - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:06:57] FIRING: [2x] KubernetesCalicoDown: kubestagemaster2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [06:08:57] RESOLVED: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [06:09:24] FIRING: [4x] SystemdUnitFailed: imagecatalog_record.service on deploy2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:10:28] FIRING: KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [06:11:57] FIRING: [3x] KubernetesCalicoDown: kubestagemaster2003.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [06:13:57] FIRING: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [06:14:24] FIRING: [4x] SystemdUnitFailed: imagecatalog_record.service on deploy2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:16:09] FIRING: [9x] KubernetesAPILatency: High Kubernetes API latency (GET ) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:20:28] RESOLVED: KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [06:20:52] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2147.codfw.wmnet with reason: Maintenance [06:20:54] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2147.codfw.wmnet with reason: Maintenance [06:21:00] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2155.codfw.wmnet with reason: Maintenance [06:21:09] RESOLVED: [9x] KubernetesAPILatency: High Kubernetes API latency (GET ) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:21:14] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2155.codfw.wmnet with reason: Maintenance [06:21:15] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2187.codfw.wmnet with reason: Maintenance [06:21:27] FIRING: [6x] ProbeDown: Service kubestagemaster2003:6443 has failed probes (http_staging_codfw_kube_apiserver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:21:28] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2187.codfw.wmnet with reason: Maintenance [06:21:35] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2155 (T367781)', diff saved to https://phabricator.wikimedia.org/P69733 and previous config saved to /var/cache/conftool/dbconfig/20241014-062135-arnaudb.json [06:21:39] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [06:21:57] FIRING: [3x] KubernetesCalicoDown: kubestagemaster2003.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [06:22:29] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1169.eqiad.wmnet with reason: Maintenance [06:22:43] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1169.eqiad.wmnet with reason: Maintenance [06:22:46] FIRING: [6x] ProbeDown: Service kubestagemaster2003:6443 has failed probes (http_staging_codfw_kube_apiserver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:22:49] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1169 (T367781)', diff saved to https://phabricator.wikimedia.org/P69734 and previous config saved to /var/cache/conftool/dbconfig/20241014-062249-arnaudb.json [06:24:24] RESOLVED: [3x] SystemdUnitFailed: imagecatalog_record.service on deploy2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:24:40] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10224612 (10phaultfinder) [06:25:05] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T367781)', diff saved to https://phabricator.wikimedia.org/P69735 and previous config saved to /var/cache/conftool/dbconfig/20241014-062505-arnaudb.json [06:26:27] FIRING: [6x] ProbeDown: Service kubestagemaster2003:6443 has failed probes (http_staging_codfw_kube_apiserver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:26:57] FIRING: [2x] KubernetesCalicoDown: kubestagemaster2003.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [06:31:57] FIRING: [2x] KubernetesCalicoDown: kubestagemaster2003.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [06:35:09] FIRING: [4x] KubernetesAPILatency: High Kubernetes API latency (GET ) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:35:28] FIRING: KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [06:36:57] FIRING: [2x] KubernetesCalicoDown: kubestagemaster2003.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [06:37:24] FIRING: SystemdUnitFailed: wmf_auto_restart_ssh.service on kubestagemaster2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:40:09] RESOLVED: [4x] KubernetesAPILatency: High Kubernetes API latency (GET ) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:40:12] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P69736 and previous config saved to /var/cache/conftool/dbconfig/20241014-064012-arnaudb.json [06:40:28] FIRING: [2x] KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [06:41:57] FIRING: [3x] KubernetesCalicoDown: kubestagemaster2003.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [06:43:40] FIRING: KubernetesRsyslogDown: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [06:45:28] FIRING: [3x] KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [06:48:40] RESOLVED: KubernetesRsyslogDown: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [06:53:46] 06SRE, 06serviceops: host rdb1014 is down - https://phabricator.wikimedia.org/T376961#10224623 (10LSobanski) [06:55:19] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P69737 and previous config saved to /var/cache/conftool/dbconfig/20241014-065519-arnaudb.json [06:55:21] 06SRE: host elastic1064 is down - https://phabricator.wikimedia.org/T376960#10224628 (10LSobanski) 05Open→03Resolved [06:55:28] FIRING: [3x] KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [06:56:47] 06SRE: host elastic1064 is down - https://phabricator.wikimedia.org/T376960#10224626 (10LSobanski) Host is up now, see {https://phabricator.wikimedia.org/T376881} [06:57:25] FIRING: [2x] SystemdUnitFailed: kube-publish-sa-cert.service on kubestagemaster2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:00:05] Amir1 and Urbanecm: Your horoscope predicts another UTC morning backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241014T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:00:28] FIRING: [3x] KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [07:02:24] FIRING: [4x] SystemdUnitFailed: kube-apiserver-safe-restart.service on kubestagemaster2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:02:58] no patch scheduled [07:03:18] excellent [07:05:28] FIRING: [3x] KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [07:10:26] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T367781)', diff saved to https://phabricator.wikimedia.org/P69738 and previous config saved to /var/cache/conftool/dbconfig/20241014-071026-arnaudb.json [07:10:28] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1184.eqiad.wmnet with reason: Maintenance [07:10:30] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [07:10:41] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1184.eqiad.wmnet with reason: Maintenance [07:10:48] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1184 (T367781)', diff saved to https://phabricator.wikimedia.org/P69739 and previous config saved to /var/cache/conftool/dbconfig/20241014-071048-arnaudb.json [07:13:03] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T367781)', diff saved to https://phabricator.wikimedia.org/P69740 and previous config saved to /var/cache/conftool/dbconfig/20241014-071302-arnaudb.json [07:22:01] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T367781)', diff saved to https://phabricator.wikimedia.org/P69741 and previous config saved to /var/cache/conftool/dbconfig/20241014-072201-arnaudb.json [07:22:05] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [07:22:24] FIRING: [4x] SystemdUnitFailed: kube-apiserver-safe-restart.service on kubestagemaster2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:23:44] (03CR) 10Giuseppe Lavagetto: [C:03+2] acme_chief: add SAN for requestctl.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1078984 (https://phabricator.wikimedia.org/T371782) (owner: 10Giuseppe Lavagetto) [07:27:24] FIRING: [4x] SystemdUnitFailed: kube-apiserver-safe-restart.service on kubestagemaster2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:28:10] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P69742 and previous config saved to /var/cache/conftool/dbconfig/20241014-072810-arnaudb.json [07:28:51] (03PS2) 10Brouberol: Import ceph-csi-cephfs chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077872 (https://phabricator.wikimedia.org/T376406) [07:28:51] (03PS3) 10Brouberol: Make it possible to deploy provisioner without the snahshotter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077873 (https://phabricator.wikimedia.org/T376406) [07:28:51] (03PS3) 10Brouberol: Run the driver-registrar as root [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077874 (https://phabricator.wikimedia.org/T376406) [07:28:52] (03PS3) 10Brouberol: Disable the priviledged security context of the liveness-prometheus container [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077875 (https://phabricator.wikimedia.org/T376406) [07:28:52] (03PS3) 10Brouberol: Make it possible to create several storage classes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078387 (https://phabricator.wikimedia.org/T376406) [07:28:54] (03PS6) 10Brouberol: Define the ceph-csi-cephfs admin_ng helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077878 (https://phabricator.wikimedia.org/T376406) [07:28:58] (03CR) 10Brouberol: "Sure!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077875 (https://phabricator.wikimedia.org/T376406) (owner: 10Brouberol) [07:30:24] (03PS1) 10Michael Große: refactor(HomepageHooks): extract method for simpler modifyability [extensions/GrowthExperiments] (wmf/1.43.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1079894 [07:32:25] FIRING: [4x] SystemdUnitFailed: kube-apiserver-safe-restart.service on kubestagemaster2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:37:08] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P69743 and previous config saved to /var/cache/conftool/dbconfig/20241014-073707-arnaudb.json [07:38:06] (03PS1) 10Brouberol: ceph.backup.s3_local: fix typo in systemd timer command [puppet] - 10https://gerrit.wikimedia.org/r/1079922 (https://phabricator.wikimedia.org/T377104) [07:39:07] (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4295/co" [puppet] - 10https://gerrit.wikimedia.org/r/1079922 (https://phabricator.wikimedia.org/T377104) (owner: 10Brouberol) [07:39:25] (03PS2) 10Michael Große: refactor(HomepageHooks): extract method for simpler modifyability [extensions/GrowthExperiments] (wmf/1.43.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1079894 [07:41:02] (03Abandoned) 10Ayounsi: Monitoring rename pfw3-codfw to pfw1 add new fasw [puppet] - 10https://gerrit.wikimedia.org/r/1079216 (https://phabricator.wikimedia.org/T374176) (owner: 10Ayounsi) [07:43:17] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P69744 and previous config saved to /var/cache/conftool/dbconfig/20241014-074317-arnaudb.json [07:45:20] (03PS2) 10Ayounsi: Prefix validator: ensure k8s role and site [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1079525 (https://phabricator.wikimedia.org/T354169) [07:46:05] (03PS1) 10Brouberol: cloudnative-pg-cluster: Simplify the s3 bucket name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079926 [07:47:25] FIRING: [4x] SystemdUnitFailed: kube-apiserver-safe-restart.service on kubestagemaster2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:50:07] (03CR) 10Brouberol: [C:03+2] cloudnative-pg-cluster: Simplify the s3 bucket name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079926 (owner: 10Brouberol) [07:52:06] (03CR) 10Elukey: "Removing the +1 since I'd like to get more into the clusterrolebinding pattern, I had a chat with Janis and there may be a solution to mak" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077872 (https://phabricator.wikimedia.org/T376406) (owner: 10Brouberol) [07:52:15] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P69745 and previous config saved to /var/cache/conftool/dbconfig/20241014-075214-arnaudb.json [07:52:25] FIRING: [4x] SystemdUnitFailed: kube-apiserver-safe-restart.service on kubestagemaster2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:52:28] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [07:52:36] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [07:55:28] FIRING: [3x] KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [07:57:25] FIRING: [4x] SystemdUnitFailed: kube-apiserver-safe-restart.service on kubestagemaster2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:58:23] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T367781)', diff saved to https://phabricator.wikimedia.org/P69746 and previous config saved to /var/cache/conftool/dbconfig/20241014-075823-arnaudb.json [07:58:25] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1186.eqiad.wmnet with reason: Maintenance [07:58:27] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [07:58:39] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1186.eqiad.wmnet with reason: Maintenance [07:58:45] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1186 (T367781)', diff saved to https://phabricator.wikimedia.org/P69747 and previous config saved to /var/cache/conftool/dbconfig/20241014-075845-arnaudb.json [08:00:28] FIRING: [3x] KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [08:00:28] !log jayme@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM kubestagemaster2005.codfw.wmnet [08:00:52] !log jayme@cumin1002 END (FAIL) - Cookbook sre.ganeti.reboot-vm (exit_code=99) for VM kubestagemaster2005.codfw.wmnet [08:01:00] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T367781)', diff saved to https://phabricator.wikimedia.org/P69748 and previous config saved to /var/cache/conftool/dbconfig/20241014-080059-arnaudb.json [08:01:11] !log jayme@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM kubestagemaster2005.codfw.wmnet [08:02:21] !log jayme@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM kubestagemaster2004.codfw.wmnet [08:05:28] RESOLVED: [2x] KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [08:07:02] !log jayme@cumin1002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM kubestagemaster2005.codfw.wmnet [08:07:22] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2155 (T367781)', diff saved to https://phabricator.wikimedia.org/P69749 and previous config saved to /var/cache/conftool/dbconfig/20241014-080721-arnaudb.json [08:07:24] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2172.codfw.wmnet with reason: Maintenance [08:07:25] RESOLVED: [3x] SystemdUnitFailed: kube-controller-manager.service on kubestagemaster2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:07:25] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [08:07:37] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2172.codfw.wmnet with reason: Maintenance [08:07:45] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2172 (T367781)', diff saved to https://phabricator.wikimedia.org/P69750 and previous config saved to /var/cache/conftool/dbconfig/20241014-080744-arnaudb.json [08:08:05] !log jayme@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM kubestagemaster2003.codfw.wmnet [08:09:35] (03PS3) 10Ayounsi: Prefix validator: ensure k8s role and site [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1079525 (https://phabricator.wikimedia.org/T354169) [08:10:37] !log jayme@cumin1002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM kubestagemaster2004.codfw.wmnet [08:10:53] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host dbproxy1028.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [08:11:17] !log elukey@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dbproxy1028.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [08:11:27] RESOLVED: [2x] ProbeDown: Service kubestagemaster2004:6443 has failed probes (http_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#kubestagemaster2004:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:11:53] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host dbproxy1028.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [08:12:07] !log elukey@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host dbproxy1028.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [08:12:52] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host dbproxy1028.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [08:13:55] !log jayme@cumin1002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM kubestagemaster2003.codfw.wmnet [08:13:57] RESOLVED: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [08:14:42] RESOLVED: KubernetesCalicoDown: kubestagemaster2003.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-staging&var-instance=kubestagemaster2003.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [08:16:07] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P69751 and previous config saved to /var/cache/conftool/dbconfig/20241014-081606-arnaudb.json [08:16:07] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dbproxy1028.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART [08:16:26] (03PS1) 10Cathal Mooney: Add "no_smallnet" term to BGP6_outfilter policy map on CRs [homer/public] - 10https://gerrit.wikimedia.org/r/1079929 [08:21:30] (03CR) 10Volans: "The approach looks good, it would be nice to add a test for it. Couple of suggestions inline." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1078658 (https://phabricator.wikimedia.org/T376712) (owner: 10Arnaudb) [08:22:09] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-staging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:22:41] (03CR) 10Cathal Mooney: [C:03+1] "LGTM!" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1079525 (https://phabricator.wikimedia.org/T354169) (owner: 10Ayounsi) [08:24:07] (03PS1) 10Giuseppe Lavagetto: Add api tokens for requestctl web ui [labs/private] - 10https://gerrit.wikimedia.org/r/1079930 [08:25:37] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] Add api tokens for requestctl web ui [labs/private] - 10https://gerrit.wikimedia.org/r/1079930 (owner: 10Giuseppe Lavagetto) [08:26:55] (03CR) 10Giuseppe Lavagetto: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1078985 (https://phabricator.wikimedia.org/T371782) (owner: 10Giuseppe Lavagetto) [08:26:57] (03CR) 10Volans: mariadb: add data directory accessor (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1078616 (https://phabricator.wikimedia.org/T376701) (owner: 10Arnaudb) [08:27:46] (03PS1) 10David Caro: toolforge::k8s::deployer: add click libraries [puppet] - 10https://gerrit.wikimedia.org/r/1079931 [08:29:57] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1079931 (owner: 10David Caro) [08:30:37] (03PS6) 10Giuseppe Lavagetto: role::alerting_host: add web interface for requestctl [puppet] - 10https://gerrit.wikimedia.org/r/1078985 (https://phabricator.wikimedia.org/T371782) [08:31:14] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P69752 and previous config saved to /var/cache/conftool/dbconfig/20241014-083113-arnaudb.json [08:32:08] (03CR) 10Giuseppe Lavagetto: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4296/co" [puppet] - 10https://gerrit.wikimedia.org/r/1078985 (https://phabricator.wikimedia.org/T371782) (owner: 10Giuseppe Lavagetto) [08:34:35] (03PS1) 10Ilias Sarantopoulos: ml-services: enable multiprocessing for enwiki-damaging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079932 (https://phabricator.wikimedia.org/T363336) [08:35:57] (03CR) 10David Caro: [C:03+2] toolforge::k8s::deployer: add click libraries [puppet] - 10https://gerrit.wikimedia.org/r/1079931 (owner: 10David Caro) [08:36:00] (03PS3) 10Arnaudb: mysql_legacy: double quote escape in run_query [software/spicerack] - 10https://gerrit.wikimedia.org/r/1078658 (https://phabricator.wikimedia.org/T376712) [08:36:43] (03CR) 10Arnaudb: "Test written!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1078658 (https://phabricator.wikimedia.org/T376712) (owner: 10Arnaudb) [08:40:02] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host dbproxy1029.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [08:40:31] (03CR) 10Ayounsi: [C:03+1] "good catch!" [homer/public] - 10https://gerrit.wikimedia.org/r/1079929 (owner: 10Cathal Mooney) [08:42:14] (03CR) 10Ayounsi: [C:03+2] Prefix validator: ensure k8s role and site [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1079525 (https://phabricator.wikimedia.org/T354169) (owner: 10Ayounsi) [08:42:32] (03PS5) 10Arnaudb: mariadb: add data directory accessor [software/spicerack] - 10https://gerrit.wikimedia.org/r/1078616 (https://phabricator.wikimedia.org/T376701) [08:42:52] (03CR) 10Arnaudb: mariadb: add data directory accessor (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1078616 (https://phabricator.wikimedia.org/T376701) (owner: 10Arnaudb) [08:43:17] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dbproxy1029.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [08:43:23] (03PS6) 10Arnaudb: mariadb: add data directory accessor [software/spicerack] - 10https://gerrit.wikimedia.org/r/1078616 (https://phabricator.wikimedia.org/T376701) [08:43:28] !log jayme@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM kubestagemaster2003.codfw.wmnet [08:43:56] (03Merged) 10jenkins-bot: Prefix validator: ensure k8s role and site [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1079525 (https://phabricator.wikimedia.org/T354169) (owner: 10Ayounsi) [08:44:35] (03CR) 10CI reject: [V:04-1] mysql_legacy: double quote escape in run_query [software/spicerack] - 10https://gerrit.wikimedia.org/r/1078658 (https://phabricator.wikimedia.org/T376712) (owner: 10Arnaudb) [08:44:41] (03PS1) 10Ayounsi: Netbox: enable prefix validator on -next [puppet] - 10https://gerrit.wikimedia.org/r/1079933 (https://phabricator.wikimedia.org/T354169) [08:44:42] (03PS1) 10Ayounsi: Netbox: enable prefix validator in prod [puppet] - 10https://gerrit.wikimedia.org/r/1079934 (https://phabricator.wikimedia.org/T354169) [08:46:21] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T367781)', diff saved to https://phabricator.wikimedia.org/P69753 and previous config saved to /var/cache/conftool/dbconfig/20241014-084620-arnaudb.json [08:46:23] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1195.eqiad.wmnet with reason: Maintenance [08:46:24] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [08:46:36] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1195.eqiad.wmnet with reason: Maintenance [08:46:43] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1195 (T367781)', diff saved to https://phabricator.wikimedia.org/P69754 and previous config saved to /var/cache/conftool/dbconfig/20241014-084643-arnaudb.json [08:46:58] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdv) failed on ms-be1065 - https://phabricator.wikimedia.org/T376775#10224838 (10MatthewVernon) p:05Triage→03High [08:47:09] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-staging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:47:28] (03PS1) 10JMeybohm: kubernetes: Create profile::kubernetes::container_runtime [puppet] - 10https://gerrit.wikimedia.org/r/1079935 (https://phabricator.wikimedia.org/T362408) [08:48:04] (03CR) 10CI reject: [V:04-1] kubernetes: Create profile::kubernetes::container_runtime [puppet] - 10https://gerrit.wikimedia.org/r/1079935 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [08:48:23] !log ayounsi@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary [08:48:57] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1195 (T367781)', diff saved to https://phabricator.wikimedia.org/P69755 and previous config saved to /var/cache/conftool/dbconfig/20241014-084856-arnaudb.json [08:48:58] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary [08:49:05] (03PS2) 10JMeybohm: kubernetes: Create profile::kubernetes::container_runtime [puppet] - 10https://gerrit.wikimedia.org/r/1079935 (https://phabricator.wikimedia.org/T362408) [08:49:08] (03CR) 10Ayounsi: [C:03+2] Netbox: enable prefix validator on -next [puppet] - 10https://gerrit.wikimedia.org/r/1079933 (https://phabricator.wikimedia.org/T354169) (owner: 10Ayounsi) [08:49:19] !log jayme@cumin1002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM kubestagemaster2003.codfw.wmnet [08:49:21] !log jayme@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM kubestagemaster2004.codfw.wmnet [08:55:24] !log jayme@cumin1002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM kubestagemaster2004.codfw.wmnet [08:55:27] !log jayme@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM kubestagemaster2005.codfw.wmnet [08:55:42] (03CR) 10Volans: mysql_legacy: double quote escape in run_query (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1078658 (https://phabricator.wikimedia.org/T376712) (owner: 10Arnaudb) [08:56:28] (03CR) 10Cathal Mooney: [C:03+2] Add "no_smallnet" term to BGP6_outfilter policy map on CRs [homer/public] - 10https://gerrit.wikimedia.org/r/1079929 (owner: 10Cathal Mooney) [08:57:00] (03Merged) 10jenkins-bot: Add "no_smallnet" term to BGP6_outfilter policy map on CRs [homer/public] - 10https://gerrit.wikimedia.org/r/1079929 (owner: 10Cathal Mooney) [08:57:15] (03PS4) 10Arnaudb: mysql_legacy: double quote escape in run_query [software/spicerack] - 10https://gerrit.wikimedia.org/r/1078658 (https://phabricator.wikimedia.org/T376712) [08:58:28] !log ayounsi@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [08:58:37] (03CR) 10Arnaudb: mysql_legacy: double quote escape in run_query (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1078658 (https://phabricator.wikimedia.org/T376712) (owner: 10Arnaudb) [08:59:47] (03CR) 10Volans: [C:03+1] "LGTM, thx." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1078616 (https://phabricator.wikimedia.org/T376701) (owner: 10Arnaudb) [09:01:06] !log jayme@cumin1002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM kubestagemaster2005.codfw.wmnet [09:01:20] (03CR) 10Ayounsi: "It's live on Netbox-next, ready for prod." [puppet] - 10https://gerrit.wikimedia.org/r/1079934 (https://phabricator.wikimedia.org/T354169) (owner: 10Ayounsi) [09:01:29] (03PS14) 10Stevemunene: Setup DPE Ceph alerts [alerts] - 10https://gerrit.wikimedia.org/r/1076460 (https://phabricator.wikimedia.org/T369583) [09:02:49] (03PS3) 10Clément Goubert: kubernetes: codfw refresh [puppet] - 10https://gerrit.wikimedia.org/r/1079239 (https://phabricator.wikimedia.org/T376170) [09:03:01] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1165.eqiad.wmnet with reason: Maintenance [09:03:15] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1165.eqiad.wmnet with reason: Maintenance [09:03:16] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1015,1019].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [09:03:29] (03PS3) 10Clément Goubert: kubernetes: codfw expansion [puppet] - 10https://gerrit.wikimedia.org/r/1079240 (https://phabricator.wikimedia.org/T376665) [09:03:33] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1015,1019].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [09:03:40] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1165 (T376905)', diff saved to https://phabricator.wikimedia.org/P69756 and previous config saved to /var/cache/conftool/dbconfig/20241014-090340-ladsgroup.json [09:04:04] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1195', diff saved to https://phabricator.wikimedia.org/P69757 and previous config saved to /var/cache/conftool/dbconfig/20241014-090403-arnaudb.json [09:04:40] (03PS2) 10Clément Goubert: kubernetes: eqiad refresh [puppet] - 10https://gerrit.wikimedia.org/r/1079241 (https://phabricator.wikimedia.org/T376185) [09:05:20] (03CR) 10Btullis: Setup DPE Ceph alerts (033 comments) [alerts] - 10https://gerrit.wikimedia.org/r/1076460 (https://phabricator.wikimedia.org/T369583) (owner: 10Stevemunene) [09:05:51] (03PS2) 10Clément Goubert: kubernetes: eqiad expansion [puppet] - 10https://gerrit.wikimedia.org/r/1079242 (https://phabricator.wikimedia.org/T376307) [09:06:35] (03CR) 10Ayounsi: "Thanks ! I think I have a slight preference for the aggregate to keep things more tidy as less prefixes are advertised to WMCS upstreams (" [homer/public] - 10https://gerrit.wikimedia.org/r/1079288 (https://phabricator.wikimedia.org/T245495) (owner: 10Cathal Mooney) [09:06:44] (03CR) 10CI reject: [V:04-1] mysql_legacy: double quote escape in run_query [software/spicerack] - 10https://gerrit.wikimedia.org/r/1078658 (https://phabricator.wikimedia.org/T376712) (owner: 10Arnaudb) [09:07:27] (03CR) 10Btullis: [C:03+1] "Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1079922 (https://phabricator.wikimedia.org/T377104) (owner: 10Brouberol) [09:07:37] (03PS5) 10Arnaudb: mysql_legacy: double quote escape in run_query [software/spicerack] - 10https://gerrit.wikimedia.org/r/1078658 (https://phabricator.wikimedia.org/T376712) [09:08:10] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T367781)', diff saved to https://phabricator.wikimedia.org/P69758 and previous config saved to /var/cache/conftool/dbconfig/20241014-090810-arnaudb.json [09:08:14] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [09:08:43] (03CR) 10Giuseppe Lavagetto: [V:03+1 C:03+2] role::alerting_host: add web interface for requestctl [puppet] - 10https://gerrit.wikimedia.org/r/1078985 (https://phabricator.wikimedia.org/T371782) (owner: 10Giuseppe Lavagetto) [09:08:56] (03PS2) 10Clément Goubert: kubestage: codfw refresh [puppet] - 10https://gerrit.wikimedia.org/r/1079257 (https://phabricator.wikimedia.org/T376171) [09:09:49] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [09:10:39] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdt) failed on ms-be1075 - https://phabricator.wikimedia.org/T377109 (10MatthewVernon) 03NEW [09:10:50] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sdt) failed on ms-be1075 - https://phabricator.wikimedia.org/T377109#10224916 (10MatthewVernon) p:05Triage→03High [09:10:53] (03CR) 10Cathal Mooney: [C:03+1] Netbox: enable prefix validator in prod [puppet] - 10https://gerrit.wikimedia.org/r/1079934 (https://phabricator.wikimedia.org/T354169) (owner: 10Ayounsi) [09:10:55] (03CR) 10David Caro: [V:03+1 C:03+2] "The only reason is that api.toolforge.org is easier to remember for users, being an intentional user entry point, more than an internal se" [puppet] - 10https://gerrit.wikimedia.org/r/1078986 (https://phabricator.wikimedia.org/T362066) (owner: 10David Caro) [09:12:25] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:14:03] (03PS1) 10Clément Goubert: mc-gp: codfw refresh [puppet] - 10https://gerrit.wikimedia.org/r/1079941 (https://phabricator.wikimedia.org/T376968) [09:14:22] (03PS1) 10Michael Große: refactor(tests): don't use per-method coverage annotation [extensions/GrowthExperiments] (wmf/1.43.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1079923 [09:15:21] (03PS2) 10Michael Große: Clear LinkRecommendation suggestions on page save [extensions/GrowthExperiments] (wmf/1.43.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1079915 (https://phabricator.wikimedia.org/T364341) [09:15:28] (03PS2) 10Michael Große: Run fixLinkRecommendationData even when disabled in CC [extensions/GrowthExperiments] (wmf/1.43.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1079925 (https://phabricator.wikimedia.org/T373176) [09:16:10] (03CR) 10Kevin Bazira: [C:03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079932 (https://phabricator.wikimedia.org/T363336) (owner: 10Ilias Sarantopoulos) [09:17:02] (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: enable multiprocessing for enwiki-damaging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079932 (https://phabricator.wikimedia.org/T363336) (owner: 10Ilias Sarantopoulos) [09:17:07] (03CR) 10Brouberol: [V:03+1 C:03+2] ceph.backup.s3_local: fix typo in systemd timer command [puppet] - 10https://gerrit.wikimedia.org/r/1079922 (https://phabricator.wikimedia.org/T377104) (owner: 10Brouberol) [09:17:21] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10Spicerack: Upload redfish licenses to supermicro hosts - https://phabricator.wikimedia.org/T376121#10224935 (10elukey) [09:18:07] (03Merged) 10jenkins-bot: ml-services: enable multiprocessing for enwiki-damaging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079932 (https://phabricator.wikimedia.org/T363336) (owner: 10Ilias Sarantopoulos) [09:18:33] (03PS1) 10Clément Goubert: mc-gp: eqiad refresh [puppet] - 10https://gerrit.wikimedia.org/r/1079942 (https://phabricator.wikimedia.org/T376186) [09:19:11] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1195', diff saved to https://phabricator.wikimedia.org/P69759 and previous config saved to /var/cache/conftool/dbconfig/20241014-091911-arnaudb.json [09:21:04] !log isaranto@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [09:21:06] (03CR) 10Ladsgroup: "I gave some notes." [puppet] - 10https://gerrit.wikimedia.org/r/1078901 (https://phabricator.wikimedia.org/T376726) (owner: 10Kosta Harlan) [09:22:00] (03CR) 10JMeybohm: "This change should not be required rn. With the correct host header set, you should be able to access mw-api via localhost:6501, keeping t" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079326 (owner: 10Jforrester) [09:23:05] (03CR) 10JMeybohm: [C:03+1] kubestage: codfw refresh [puppet] - 10https://gerrit.wikimedia.org/r/1079257 (https://phabricator.wikimedia.org/T376171) (owner: 10Clément Goubert) [09:23:17] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P69760 and previous config saved to /var/cache/conftool/dbconfig/20241014-092317-arnaudb.json [09:23:41] (03CR) 10Ayounsi: [C:03+2] Netbox: enable prefix validator in prod [puppet] - 10https://gerrit.wikimedia.org/r/1079934 (https://phabricator.wikimedia.org/T354169) (owner: 10Ayounsi) [09:24:18] (03PS4) 10Clément Goubert: kubernetes: codfw refresh [puppet] - 10https://gerrit.wikimedia.org/r/1079239 (https://phabricator.wikimedia.org/T376170) [09:24:18] (03PS4) 10Clément Goubert: kubernetes: codfw expansion [puppet] - 10https://gerrit.wikimedia.org/r/1079240 (https://phabricator.wikimedia.org/T376665) [09:24:19] (03PS3) 10Clément Goubert: kubernetes: eqiad refresh [puppet] - 10https://gerrit.wikimedia.org/r/1079241 (https://phabricator.wikimedia.org/T376185) [09:24:19] (03PS3) 10Clément Goubert: kubernetes: eqiad expansion [puppet] - 10https://gerrit.wikimedia.org/r/1079242 (https://phabricator.wikimedia.org/T376307) [09:24:45] (03PS3) 10Clément Goubert: kubestage: codfw refresh [puppet] - 10https://gerrit.wikimedia.org/r/1079257 (https://phabricator.wikimedia.org/T376171) [09:26:55] (03PS1) 10Giuseppe Lavagetto: hiddenparma: various bugfixes [puppet] - 10https://gerrit.wikimedia.org/r/1079944 [09:27:28] (03CR) 10JMeybohm: [C:03+1] kubernetes: eqiad refresh [puppet] - 10https://gerrit.wikimedia.org/r/1079241 (https://phabricator.wikimedia.org/T376185) (owner: 10Clément Goubert) [09:29:05] jouncebot: nowandnext [09:29:05] No deployments scheduled for the next 0 hour(s) and 30 minute(s) [09:29:05] In 0 hour(s) and 30 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241014T1000) [09:30:10] (03CR) 10JMeybohm: [C:04-1] kubernetes: eqiad expansion (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1079242 (https://phabricator.wikimedia.org/T376307) (owner: 10Clément Goubert) [09:32:51] (03PS2) 10Giuseppe Lavagetto: hiddenparma: various bugfixes [puppet] - 10https://gerrit.wikimedia.org/r/1079944 [09:33:01] (03PS4) 10Clément Goubert: kubernetes: eqiad expansion [puppet] - 10https://gerrit.wikimedia.org/r/1079242 (https://phabricator.wikimedia.org/T376307) [09:33:57] (03CR) 10JMeybohm: [C:04-1] kubernetes: codfw refresh (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1079239 (https://phabricator.wikimedia.org/T376170) (owner: 10Clément Goubert) [09:34:18] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1195 (T367781)', diff saved to https://phabricator.wikimedia.org/P69761 and previous config saved to /var/cache/conftool/dbconfig/20241014-093418-arnaudb.json [09:34:21] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1196.eqiad.wmnet with reason: Maintenance [09:34:21] (03PS1) 10Ladsgroup: Init config for rskwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079947 (https://phabricator.wikimedia.org/T374963) [09:34:22] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [09:34:27] (03CR) 10Giuseppe Lavagetto: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4299/co" [puppet] - 10https://gerrit.wikimedia.org/r/1079944 (owner: 10Giuseppe Lavagetto) [09:34:35] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1196.eqiad.wmnet with reason: Maintenance [09:34:36] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1013,1017].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [09:34:53] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1013,1017].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [09:35:00] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1196 (T367781)', diff saved to https://phabricator.wikimedia.org/P69762 and previous config saved to /var/cache/conftool/dbconfig/20241014-093459-arnaudb.json [09:35:02] (03CR) 10CI reject: [V:04-1] Init config for rskwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079947 (https://phabricator.wikimedia.org/T374963) (owner: 10Ladsgroup) [09:35:08] (03CR) 10CI reject: [V:04-1] kubernetes: eqiad expansion [puppet] - 10https://gerrit.wikimedia.org/r/1079242 (https://phabricator.wikimedia.org/T376307) (owner: 10Clément Goubert) [09:36:30] (03CR) 10Giuseppe Lavagetto: [V:03+1 C:03+2] hiddenparma: various bugfixes [puppet] - 10https://gerrit.wikimedia.org/r/1079944 (owner: 10Giuseppe Lavagetto) [09:36:46] !log isaranto@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [09:37:14] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T367781)', diff saved to https://phabricator.wikimedia.org/P69763 and previous config saved to /var/cache/conftool/dbconfig/20241014-093713-arnaudb.json [09:37:28] (03CR) 10Tiziano Fogli: [C:03+1] alertmanager-irc: improve ErrorBudgetBurn SLO alert text (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1078718 (https://phabricator.wikimedia.org/T376740) (owner: 10Herron) [09:37:50] (03PS2) 10Ladsgroup: Init config for rskwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079947 (https://phabricator.wikimedia.org/T374963) [09:38:08] (03PS5) 10Clément Goubert: kubernetes: codfw refresh [puppet] - 10https://gerrit.wikimedia.org/r/1079239 (https://phabricator.wikimedia.org/T376170) [09:38:24] (03CR) 10JMeybohm: [C:03+1] kubernetes: codfw expansion (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1079240 (https://phabricator.wikimedia.org/T376665) (owner: 10Clément Goubert) [09:38:25] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P69764 and previous config saved to /var/cache/conftool/dbconfig/20241014-093824-arnaudb.json [09:38:36] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1079935 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [09:39:02] (03CR) 10Ladsgroup: [C:03+2] Init config for rskwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079947 (https://phabricator.wikimedia.org/T374963) (owner: 10Ladsgroup) [09:39:42] (03Merged) 10jenkins-bot: Init config for rskwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079947 (https://phabricator.wikimedia.org/T374963) (owner: 10Ladsgroup) [09:39:49] (03CR) 10Clément Goubert: kubernetes: codfw refresh (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1079239 (https://phabricator.wikimedia.org/T376170) (owner: 10Clément Goubert) [09:40:10] (03Abandoned) 10Tiziano Fogli: curator: free up space to safely restart daemons [puppet] - 10https://gerrit.wikimedia.org/r/1064781 (https://phabricator.wikimedia.org/T371961) (owner: 10Tiziano Fogli) [09:40:20] (03CR) 10Clément Goubert: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1079242 (https://phabricator.wikimedia.org/T376307) (owner: 10Clément Goubert) [09:41:20] !log ladsgroup@deploy2002 Started scap sync-world: Creating rskwiki (T374963) [09:41:24] T374963: Create Wikipedia Pannonian Rusyn - https://phabricator.wikimedia.org/T374963 [09:42:26] (03PS5) 10Clément Goubert: kubernetes: eqiad expansion [puppet] - 10https://gerrit.wikimedia.org/r/1079242 (https://phabricator.wikimedia.org/T376307) [09:43:03] (03PS1) 10Giuseppe Lavagetto: hiddenparma: remove notifying the app from the api token files [puppet] - 10https://gerrit.wikimedia.org/r/1079952 [09:43:57] (03CR) 10Clément Goubert: kubernetes: eqiad expansion (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1079242 (https://phabricator.wikimedia.org/T376307) (owner: 10Clément Goubert) [09:44:29] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] [uawikimedia] Enable the CampaignEvents extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079518 (https://phabricator.wikimedia.org/T376695) (owner: 10Daimona Eaytoy) [09:44:40] (03CR) 10CI reject: [V:04-1] kubernetes: eqiad expansion [puppet] - 10https://gerrit.wikimedia.org/r/1079242 (https://phabricator.wikimedia.org/T376307) (owner: 10Clément Goubert) [09:45:22] (03PS6) 10Clément Goubert: kubernetes: eqiad expansion [puppet] - 10https://gerrit.wikimedia.org/r/1079242 (https://phabricator.wikimedia.org/T376307) [09:45:32] (03CR) 10Giuseppe Lavagetto: [C:03+2] hiddenparma: remove notifying the app from the api token files [puppet] - 10https://gerrit.wikimedia.org/r/1079952 (owner: 10Giuseppe Lavagetto) [09:45:48] (03PS6) 10Clément Goubert: kubernetes: codfw refresh [puppet] - 10https://gerrit.wikimedia.org/r/1079239 (https://phabricator.wikimedia.org/T376170) [09:45:48] (03PS5) 10Clément Goubert: kubernetes: codfw expansion [puppet] - 10https://gerrit.wikimedia.org/r/1079240 (https://phabricator.wikimedia.org/T376665) [09:45:48] (03PS4) 10Clément Goubert: kubernetes: eqiad refresh [puppet] - 10https://gerrit.wikimedia.org/r/1079241 (https://phabricator.wikimedia.org/T376185) [09:45:49] (03PS7) 10Clément Goubert: kubernetes: eqiad expansion [puppet] - 10https://gerrit.wikimedia.org/r/1079242 (https://phabricator.wikimedia.org/T376307) [09:45:51] (03PS4) 10Clément Goubert: kubestage: codfw refresh [puppet] - 10https://gerrit.wikimedia.org/r/1079257 (https://phabricator.wikimedia.org/T376171) [09:47:54] (03PS26) 10Arnaudb: mariadb: clone cookbook maintenance [cookbooks] - 10https://gerrit.wikimedia.org/r/1071155 (https://phabricator.wikimedia.org/T374191) [09:48:49] (03PS27) 10Arnaudb: mariadb: clone cookbook maintenance [cookbooks] - 10https://gerrit.wikimedia.org/r/1071155 (https://phabricator.wikimedia.org/T374191) [09:49:10] (03PS28) 10Arnaudb: mariadb: clone cookbook maintenance [cookbooks] - 10https://gerrit.wikimedia.org/r/1071155 (https://phabricator.wikimedia.org/T374191) [09:50:29] (03CR) 10Phuedx: [C:03+1] "Further to @aotto@wikimedia.org and @xcollazo@wikimedia.org's +1's. All changes LGTM!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077504 (https://phabricator.wikimedia.org/T376065) (owner: 10Kimberly Sarabia) [09:51:08] (03PS1) 10Ladsgroup: Add namespace translations for Tai Nüa (tdd) [core] (wmf/1.43.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1079954 (https://phabricator.wikimedia.org/T375421) [09:51:23] (03CR) 10Ladsgroup: [C:03+2] Add namespace translations for Tai Nüa (tdd) [core] (wmf/1.43.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1079954 (https://phabricator.wikimedia.org/T375421) (owner: 10Ladsgroup) [09:52:21] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P69765 and previous config saved to /var/cache/conftool/dbconfig/20241014-095220-arnaudb.json [09:53:32] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T367781)', diff saved to https://phabricator.wikimedia.org/P69766 and previous config saved to /var/cache/conftool/dbconfig/20241014-095331-arnaudb.json [09:53:34] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2219.codfw.wmnet with reason: Maintenance [09:53:36] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [09:53:46] (03PS2) 10Ladsgroup: mariadb: Add SLAVE MONITOR to promotheus grants [puppet] - 10https://gerrit.wikimedia.org/r/1079006 [09:53:48] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2219.codfw.wmnet with reason: Maintenance [09:53:50] (03CR) 10Ladsgroup: [C:03+2] mariadb: Add SLAVE MONITOR to promotheus grants [puppet] - 10https://gerrit.wikimedia.org/r/1079006 (owner: 10Ladsgroup) [09:53:53] (03CR) 10Ladsgroup: [V:03+2 C:03+2] mariadb: Add SLAVE MONITOR to promotheus grants [puppet] - 10https://gerrit.wikimedia.org/r/1079006 (owner: 10Ladsgroup) [09:53:55] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2219 (T367781)', diff saved to https://phabricator.wikimedia.org/P69767 and previous config saved to /var/cache/conftool/dbconfig/20241014-095354-arnaudb.json [09:54:08] (03PS1) 10JMeybohm: Merge worker_containerd values back into worker [labs/private] - 10https://gerrit.wikimedia.org/r/1079955 (https://phabricator.wikimedia.org/T362408) [09:54:08] (03CR) 10Ladsgroup: [V:03+2 C:03+2] "FWIW, it was on other pc hosts." [puppet] - 10https://gerrit.wikimedia.org/r/1079006 (owner: 10Ladsgroup) [09:55:21] jouncebot: nowandnext [09:55:21] No deployments scheduled for the next 0 hour(s) and 4 minute(s) [09:55:21] In 0 hour(s) and 4 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241014T1000) [09:55:26] hnowlan: I'm deploying :D [09:57:46] (03PS2) 10JMeybohm: Merge worker_containerd back to worker role [labs/private] - 10https://gerrit.wikimedia.org/r/1079955 (https://phabricator.wikimedia.org/T362408) [09:59:05] !log oblivian@cumin2002 START - Cookbook sre.deploy.python-code hiddenparma to alert2002.wikimedia.org with reason: init - oblivian@cumin2002 [09:59:05] !log oblivian@cumin2002 END (FAIL) - Cookbook sre.deploy.python-code (exit_code=99) hiddenparma to alert2002.wikimedia.org with reason: init - oblivian@cumin2002 [09:59:36] Amir1: cool :) [09:59:59] !log ladsgroup@deploy2002 Finished scap sync-world: Creating rskwiki (T374963) (duration: 18m 38s) [09:59:59] !log oblivian@cumin2002 START - Cookbook sre.deploy.python-code hiddenparma to alert2002.wikimedia.org with reason: init - oblivian@cumin2002 [10:00:00] !log oblivian@cumin2002 END (FAIL) - Cookbook sre.deploy.python-code (exit_code=99) hiddenparma to alert2002.wikimedia.org with reason: init - oblivian@cumin2002 [10:00:00] !log eoghan@cumin2002 START - Cookbook sre.hosts.reboot-single for host lists2001.wikimedia.org [10:00:04] T374963: Create Wikipedia Pannonian Rusyn - https://phabricator.wikimedia.org/T374963 [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241014T1000) [10:00:35] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [core] (wmf/1.43.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1079954 (https://phabricator.wikimedia.org/T375421) (owner: 10Ladsgroup) [10:00:45] !log powercycle rdb1014 T376961 [10:00:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:48] T376961: host rdb1014 is down - https://phabricator.wikimedia.org/T376961 [10:01:39] (03PS29) 10Arnaudb: mariadb: clone cookbook maintenance [cookbooks] - 10https://gerrit.wikimedia.org/r/1071155 (https://phabricator.wikimedia.org/T374191) [10:02:33] (03CR) 10JMeybohm: [V:03+2 C:03+2] Merge worker_containerd back to worker role [labs/private] - 10https://gerrit.wikimedia.org/r/1079955 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [10:02:53] (03CR) 10Volans: mariadb: clone cookbook maintenance (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1071155 (https://phabricator.wikimedia.org/T374191) (owner: 10Arnaudb) [10:03:56] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T376905)', diff saved to https://phabricator.wikimedia.org/P69768 and previous config saved to /var/cache/conftool/dbconfig/20241014-100356-ladsgroup.json [10:05:12] (03PS1) 10Ladsgroup: Init config for nrwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079958 (https://phabricator.wikimedia.org/T375087) [10:05:30] (03CR) 10JMeybohm: [V:03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4300/co" [puppet] - 10https://gerrit.wikimedia.org/r/1079935 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [10:06:19] !log eoghan@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lists2001.wikimedia.org [10:06:42] !log eoghan@cumin2002 START - Cookbook sre.hosts.reboot-single for host lists1004.wikimedia.org [10:07:28] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P69769 and previous config saved to /var/cache/conftool/dbconfig/20241014-100727-arnaudb.json [10:07:30] (03PS30) 10Arnaudb: mariadb: clone cookbook maintenance [cookbooks] - 10https://gerrit.wikimedia.org/r/1071155 (https://phabricator.wikimedia.org/T374191) [10:08:20] (03CR) 10Alexandros Kosiaris: [C:03+1] kubernetes: Create profile::kubernetes::container_runtime [puppet] - 10https://gerrit.wikimedia.org/r/1079935 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [10:09:07] (03CR) 10JMeybohm: [V:03+1 C:03+2] kubernetes: Create profile::kubernetes::container_runtime [puppet] - 10https://gerrit.wikimedia.org/r/1079935 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [10:11:57] (03PS1) 10JMeybohm: cumin/aliases: Merge worker_containerd back to worker role [puppet] - 10https://gerrit.wikimedia.org/r/1079960 (https://phabricator.wikimedia.org/T362408) [10:12:47] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling for reclone (T375652)', diff saved to https://phabricator.wikimedia.org/P69770 and previous config saved to /var/cache/conftool/dbconfig/20241014-101246-ladsgroup.json [10:12:50] T375652: Wikimedia\Rdbms\DBQueryError: Error 1062: Duplicate entry '1' for key 'PRIMARY' Function: MediaWiki\CheckUser\Services\CheckUserLogService::addLogEntry - https://phabricator.wikimedia.org/T375652 [10:13:31] !log eoghan@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lists1004.wikimedia.org [10:13:55] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling for reclone (T375652)', diff saved to https://phabricator.wikimedia.org/P69771 and previous config saved to /var/cache/conftool/dbconfig/20241014-101354-ladsgroup.json [10:14:21] (03CR) 10JMeybohm: [C:03+2] cumin/aliases: Merge worker_containerd back to worker role [puppet] - 10https://gerrit.wikimedia.org/r/1079960 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [10:15:57] (03PS1) 10JMeybohm: Remove role kubernetes::staging::worker_containerd [labs/private] - 10https://gerrit.wikimedia.org/r/1079961 (https://phabricator.wikimedia.org/T362408) [10:17:14] (03CR) 10JMeybohm: [V:03+2 C:03+2] Remove role kubernetes::staging::worker_containerd [labs/private] - 10https://gerrit.wikimedia.org/r/1079961 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [10:17:34] !log ladsgroup@cumin1002 START - Cookbook sre.mysql.clone of db2194.codfw.wmnet onto db2227.codfw.wmnet [10:17:46] 06SRE, 06serviceops: host rdb1014 is down - https://phabricator.wikimedia.org/T376961#10225247 (10akosiaris) 05Open→03Resolved a:03akosiaris The host has some history of failure per {T370633} It is the passive failover for rdb1013, which means we have no degradation of anything right now. Nothing... [10:19:03] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P69772 and previous config saved to /var/cache/conftool/dbconfig/20241014-101903-ladsgroup.json [10:19:24] (03CR) 10Volans: [C:03+1] "It now looks in a good state for starting testing it on safe instances with the test-cookbook." [cookbooks] - 10https://gerrit.wikimedia.org/r/1071155 (https://phabricator.wikimedia.org/T374191) (owner: 10Arnaudb) [10:20:31] FIRING: RedisReplicaDown: Redis replica down rdb1014:16379 redis_misc - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-site=eqiad&var-job=redis_misc&var-instance=rdb1014:16379 - https://alerts.wikimedia.org/?q=alertname%3DRedisReplicaDown [10:21:19] (03CR) 10Clément Goubert: "Thanks for that, hope the amended docstrings are clearer." [cookbooks] - 10https://gerrit.wikimedia.org/r/912813 (https://phabricator.wikimedia.org/T335364) (owner: 10Clément Goubert) [10:22:31] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, October 14 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079619 (https://phabricator.wikimedia.org/T362620) (owner: 10NMW03) [10:22:35] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T367781)', diff saved to https://phabricator.wikimedia.org/P69773 and previous config saved to /var/cache/conftool/dbconfig/20241014-102234-arnaudb.json [10:22:36] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1206.eqiad.wmnet with reason: Maintenance [10:22:38] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [10:22:48] (03PS6) 10Clément Goubert: sre.discovery.datacenter: Add failover_from action [cookbooks] - 10https://gerrit.wikimedia.org/r/912813 (https://phabricator.wikimedia.org/T335364) [10:22:50] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1206.eqiad.wmnet with reason: Maintenance [10:22:57] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1206 (T367781)', diff saved to https://phabricator.wikimedia.org/P69774 and previous config saved to /var/cache/conftool/dbconfig/20241014-102256-arnaudb.json [10:24:13] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T367781)', diff saved to https://phabricator.wikimedia.org/P69775 and previous config saved to /var/cache/conftool/dbconfig/20241014-102412-arnaudb.json [10:24:22] (03Merged) 10jenkins-bot: Add namespace translations for Tai Nüa (tdd) [core] (wmf/1.43.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1079954 (https://phabricator.wikimedia.org/T375421) (owner: 10Ladsgroup) [10:25:01] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1079954|Add namespace translations for Tai Nüa (tdd) (T375421)]] [10:25:04] T375421: Prepare MessagesTdd.php for Tai Nüa Wikipedia - https://phabricator.wikimedia.org/T375421 [10:25:27] jouncebot: now [10:25:27] For the next 0 hour(s) and 34 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241014T1000) [10:25:29] (03Abandoned) 10Cathal Mooney: Add orlonger to policy on announced v6 routes from cloudsw [homer/public] - 10https://gerrit.wikimedia.org/r/1079288 (https://phabricator.wikimedia.org/T245495) (owner: 10Cathal Mooney) [10:25:33] (03CR) 10Cathal Mooney: "ok np." [homer/public] - 10https://gerrit.wikimedia.org/r/1079288 (https://phabricator.wikimedia.org/T245495) (owner: 10Cathal Mooney) [10:27:05] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1079954|Add namespace translations for Tai Nüa (tdd) (T375421)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [10:27:17] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [10:27:22] (03CR) 10Ladsgroup: [C:03+2] Init config for nrwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079958 (https://phabricator.wikimedia.org/T375087) (owner: 10Ladsgroup) [10:28:04] (03Merged) 10jenkins-bot: Init config for nrwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079958 (https://phabricator.wikimedia.org/T375087) (owner: 10Ladsgroup) [10:28:22] jouncebot: next [10:28:22] In 2 hour(s) and 31 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241014T1300) [10:29:39] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10225275 (10phaultfinder) [10:31:47] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1079954|Add namespace translations for Tai Nüa (tdd) (T375421)]] (duration: 06m 45s) [10:31:50] T375421: Prepare MessagesTdd.php for Tai Nüa Wikipedia - https://phabricator.wikimedia.org/T375421 [10:32:10] (03PS1) 10Ladsgroup: Init config for tddwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079969 (https://phabricator.wikimedia.org/T375422) [10:33:11] !log ladsgroup@deploy2002 Started scap sync-world: Creating nrwiki (T375087) [10:33:15] T375087: Create Wikipedia South Ndebele - https://phabricator.wikimedia.org/T375087 [10:34:10] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P69776 and previous config saved to /var/cache/conftool/dbconfig/20241014-103410-ladsgroup.json [10:34:21] (03PS1) 10JMeybohm: wikikube: Prepare clusters for containerd workers [puppet] - 10https://gerrit.wikimedia.org/r/1079970 (https://phabricator.wikimedia.org/T362408) [10:35:45] !log oblivian@cumin2002 START - Cookbook sre.deploy.python-code hiddenparma to alert2002.wikimedia.org with reason: init - oblivian@cumin2002 [10:35:46] (03PS2) 10JMeybohm: wikikube: Prepare clusters for containerd workers [puppet] - 10https://gerrit.wikimedia.org/r/1079970 (https://phabricator.wikimedia.org/T362408) [10:35:59] !log oblivian@cumin2002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert2002.wikimedia.org with reason: init - oblivian@cumin2002 [10:36:29] (03CR) 10JMeybohm: [C:03+1] kubernetes: eqiad expansion [puppet] - 10https://gerrit.wikimedia.org/r/1079242 (https://phabricator.wikimedia.org/T376307) (owner: 10Clément Goubert) [10:36:36] (03CR) 10JMeybohm: [C:03+1] kubernetes: codfw refresh [puppet] - 10https://gerrit.wikimedia.org/r/1079239 (https://phabricator.wikimedia.org/T376170) (owner: 10Clément Goubert) [10:36:52] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1079970 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [10:37:44] (03CR) 10Ladsgroup: [C:03+2] Init config for tddwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079969 (https://phabricator.wikimedia.org/T375422) (owner: 10Ladsgroup) [10:38:30] (03Merged) 10jenkins-bot: Init config for tddwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079969 (https://phabricator.wikimedia.org/T375422) (owner: 10Ladsgroup) [10:38:52] (03PS17) 10Jelto: miscweb: add support to mount add confimaps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079465 (https://phabricator.wikimedia.org/T350793) [10:39:06] (03PS19) 10Jelto: wikidata-query-gui: mount custom-config.json into pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079466 (https://phabricator.wikimedia.org/T350793) [10:39:20] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P69777 and previous config saved to /var/cache/conftool/dbconfig/20241014-103919-arnaudb.json [10:40:06] !log ladsgroup@deploy2002 Finished scap sync-world: Creating nrwiki (T375087) (duration: 06m 54s) [10:40:10] T375087: Create Wikipedia South Ndebele - https://phabricator.wikimedia.org/T375087 [10:42:09] !log ladsgroup@deploy2002 Started scap sync-world: Creating tddwiki (T375422) [10:42:12] T375422: Create Wikipedia Tai Nüa - https://phabricator.wikimedia.org/T375422 [10:43:17] (03PS1) 10Btullis: Remove the dumps_store_load_average icinga check [puppet] - 10https://gerrit.wikimedia.org/r/1079971 (https://phabricator.wikimedia.org/T374821) [10:44:02] !log oblivian@cumin2002 START - Cookbook sre.deploy.python-code hiddenparma to alert1002.wikimedia.org with reason: init - oblivian@cumin2002 [10:44:21] !log oblivian@cumin2002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert1002.wikimedia.org with reason: init - oblivian@cumin2002 [10:44:37] (03CR) 10Alexandros Kosiaris: [C:03+2] mw-script: Add prometheus-statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078666 (https://phabricator.wikimedia.org/T376714) (owner: 10Alexandros Kosiaris) [10:44:46] (03PS1) 10Ladsgroup: Init config for annwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079972 (https://phabricator.wikimedia.org/T376332) [10:45:40] (03Merged) 10jenkins-bot: mw-script: Add prometheus-statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078666 (https://phabricator.wikimedia.org/T376714) (owner: 10Alexandros Kosiaris) [10:46:34] (03CR) 10Ladsgroup: [C:03+2] Init config for annwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079972 (https://phabricator.wikimedia.org/T376332) (owner: 10Ladsgroup) [10:47:16] (03PS1) 10Jcrespo: check footer legal complience: Add support for relative URLs [puppet] - 10https://gerrit.wikimedia.org/r/1079973 (https://phabricator.wikimedia.org/T375789) [10:47:24] (03Merged) 10jenkins-bot: Init config for annwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079972 (https://phabricator.wikimedia.org/T376332) (owner: 10Ladsgroup) [10:47:43] (03Restored) 10Btullis: Set a non-default mapreduce file committer algorithm for spark [puppet] - 10https://gerrit.wikimedia.org/r/975006 (https://phabricator.wikimedia.org/T351388) (owner: 10Btullis) [10:48:02] elukey: volans sending patch to fix the "Ensure legal html en.wp" alert [10:48:04] (03CR) 10CI reject: [V:04-1] check footer legal complience: Add support for relative URLs [puppet] - 10https://gerrit.wikimedia.org/r/1079973 (https://phabricator.wikimedia.org/T375789) (owner: 10Jcrespo) [10:48:37] jynus: ack thanks! [10:48:53] (03CR) 10Btullis: [V:03+1] "Reopening this patch, based on the comment here:" [puppet] - 10https://gerrit.wikimedia.org/r/975006 (https://phabricator.wikimedia.org/T351388) (owner: 10Btullis) [10:48:55] !log ladsgroup@deploy2002 Finished scap sync-world: Creating tddwiki (T375422) (duration: 06m 46s) [10:49:00] T375422: Create Wikipedia Tai Nüa - https://phabricator.wikimedia.org/T375422 [10:49:17] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T376905)', diff saved to https://phabricator.wikimedia.org/P69778 and previous config saved to /var/cache/conftool/dbconfig/20241014-104916-ladsgroup.json [10:49:21] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1168.eqiad.wmnet with reason: Maintenance [10:49:29] (03CR) 10Alexandros Kosiaris: [C:03+2] mw-script: Remove ci_only_release_do_not_deploy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078672 (owner: 10Alexandros Kosiaris) [10:49:35] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1168.eqiad.wmnet with reason: Maintenance [10:49:36] (03CR) 10CI reject: [V:04-1] mw-script: Remove ci_only_release_do_not_deploy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078672 (owner: 10Alexandros Kosiaris) [10:49:37] (03PS4) 10Alexandros Kosiaris: mw-script: Remove ci_only_release_do_not_deploy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078672 [10:49:42] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1168 (T376905)', diff saved to https://phabricator.wikimedia.org/P69779 and previous config saved to /var/cache/conftool/dbconfig/20241014-104941-ladsgroup.json [10:51:31] !log mvernon@cumin1002 START - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies rolling restart_daemons on A:thanos-fe [10:52:08] !log ladsgroup@deploy2002 Started scap sync-world: Creating annwiki (T376332) [10:52:11] T376332: Create Wikipedia Obolo - https://phabricator.wikimedia.org/T376332 [10:52:51] (03PS2) 10Jcrespo: check footer legal complience: Add support for relative URLs [puppet] - 10https://gerrit.wikimedia.org/r/1079973 (https://phabricator.wikimedia.org/T375789) [10:54:10] (03PS3) 10Jcrespo: check footer legal complience: Add support for relative URLs [puppet] - 10https://gerrit.wikimedia.org/r/1079973 (https://phabricator.wikimedia.org/T375789) [10:54:21] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2219 (T367781)', diff saved to https://phabricator.wikimedia.org/P69780 and previous config saved to /var/cache/conftool/dbconfig/20241014-105421-arnaudb.json [10:54:25] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [10:54:27] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P69781 and previous config saved to /var/cache/conftool/dbconfig/20241014-105426-arnaudb.json [10:55:20] (03PS15) 10Stevemunene: Setup DPE Ceph alerts [alerts] - 10https://gerrit.wikimedia.org/r/1076460 (https://phabricator.wikimedia.org/T369583) [10:55:27] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies (exit_code=0) rolling restart_daemons on A:thanos-fe [10:55:40] (03CR) 10Alexandros Kosiaris: mw-script: Remove ci_only_release_do_not_deploy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078672 (owner: 10Alexandros Kosiaris) [10:55:58] (03CR) 10Alexandros Kosiaris: [C:03+2] "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078672 (owner: 10Alexandros Kosiaris) [10:56:48] (03Merged) 10jenkins-bot: mw-script: Remove ci_only_release_do_not_deploy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078672 (owner: 10Alexandros Kosiaris) [10:57:11] (03PS1) 10Ladsgroup: Init config for ibawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079974 (https://phabricator.wikimedia.org/T376568) [10:57:56] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T376905)', diff saved to https://phabricator.wikimedia.org/P69782 and previous config saved to /var/cache/conftool/dbconfig/20241014-105755-ladsgroup.json [10:58:53] !log ladsgroup@deploy2002 Finished scap sync-world: Creating annwiki (T376332) (duration: 06m 45s) [10:58:57] T376332: Create Wikipedia Obolo - https://phabricator.wikimedia.org/T376332 [10:59:06] (03CR) 10Ladsgroup: [C:03+2] Init config for ibawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079974 (https://phabricator.wikimedia.org/T376568) (owner: 10Ladsgroup) [11:00:04] (03Merged) 10jenkins-bot: Init config for ibawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079974 (https://phabricator.wikimedia.org/T376568) (owner: 10Ladsgroup) [11:00:26] 06SRE, 10SRE-swift-storage: The file "XXX" is in an inconsistent state within the internal storage backends - https://phabricator.wikimedia.org/T291137#10225382 (10MatthewVernon) @Yann Please open new tickets if you have a new object you want looking at, otherwise these phab tickets just become a series of loo... [11:00:32] (03CR) 10Btullis: [C:03+1] "Looks good to me." [alerts] - 10https://gerrit.wikimedia.org/r/1076460 (https://phabricator.wikimedia.org/T369583) (owner: 10Stevemunene) [11:00:42] !log eoghan@cumin2002 START - Cookbook sre.hosts.reboot-single for host vrts1003.eqiad.wmnet [11:01:04] !log ladsgroup@deploy2002 Started scap sync-world: Creating ibawiki (T376568) [11:01:09] T376568: Create Wikipedia Iban - https://phabricator.wikimedia.org/T376568 [11:01:27] (03CR) 10Jcrespo: [C:04-1] "Barely tested, needs checking against https://people.wikimedia.org/~jynus/ examples." [puppet] - 10https://gerrit.wikimedia.org/r/1079973 (https://phabricator.wikimedia.org/T375789) (owner: 10Jcrespo) [11:03:34] (03PS1) 10Ladsgroup: Init config for bclwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079976 (https://phabricator.wikimedia.org/T377084) [11:05:08] (03CR) 10Hnowlan: php-cli: include mercurius in 8.1 image (032 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1077682 (https://phabricator.wikimedia.org/T371699) (owner: 10Hnowlan) [11:05:38] !log eoghan@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1003.eqiad.wmnet [11:06:11] thanks jynus [11:07:32] (03CR) 10Ladsgroup: [C:03+2] Init config for bclwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079976 (https://phabricator.wikimedia.org/T377084) (owner: 10Ladsgroup) [11:07:50] !log ladsgroup@deploy2002 Finished scap sync-world: Creating ibawiki (T376568) (duration: 06m 45s) [11:07:55] T376568: Create Wikipedia Iban - https://phabricator.wikimedia.org/T376568 [11:08:24] (03Merged) 10jenkins-bot: Init config for bclwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079976 (https://phabricator.wikimedia.org/T377084) (owner: 10Ladsgroup) [11:09:28] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P69783 and previous config saved to /var/cache/conftool/dbconfig/20241014-110927-arnaudb.json [11:09:34] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T367781)', diff saved to https://phabricator.wikimedia.org/P69784 and previous config saved to /var/cache/conftool/dbconfig/20241014-110933-arnaudb.json [11:09:35] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1207.eqiad.wmnet with reason: Maintenance [11:09:37] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [11:09:49] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1207.eqiad.wmnet with reason: Maintenance [11:09:56] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1207 (T367781)', diff saved to https://phabricator.wikimedia.org/P69785 and previous config saved to /var/cache/conftool/dbconfig/20241014-110956-arnaudb.json [11:09:57] !log ladsgroup@deploy2002 Started scap sync-world: Creating bclwikisource (T377084) [11:10:03] T377084: Create Wikisource Central Bikol - https://phabricator.wikimedia.org/T377084 [11:12:12] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1207 (T367781)', diff saved to https://phabricator.wikimedia.org/P69786 and previous config saved to /var/cache/conftool/dbconfig/20241014-111211-arnaudb.json [11:13:03] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P69787 and previous config saved to /var/cache/conftool/dbconfig/20241014-111302-ladsgroup.json [11:13:58] (03PS3) 10JMeybohm: wikikube: Prepare clusters for containerd workers [puppet] - 10https://gerrit.wikimedia.org/r/1079970 (https://phabricator.wikimedia.org/T362408) [11:14:12] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1079970 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [11:16:47] !log ladsgroup@deploy2002 Finished scap sync-world: Creating bclwikisource (T377084) (duration: 06m 49s) [11:16:50] T377084: Create Wikisource Central Bikol - https://phabricator.wikimedia.org/T377084 [11:20:44] (03PS1) 10Hashar: Merge tag 'v3.10.2' into wmf/stable-3.10 [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1079981 (https://phabricator.wikimedia.org/T373897) [11:21:42] (03PS1) 10Cathal Mooney: Remove WMCS codfw prefix from CR aggregate conf and adjust outfilter [homer/public] - 10https://gerrit.wikimedia.org/r/1079982 (https://phabricator.wikimedia.org/T245495) [11:24:35] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P69788 and previous config saved to /var/cache/conftool/dbconfig/20241014-112434-arnaudb.json [11:26:31] !log Running ./redis-check-aof --fix on rdb1014 tcp_6379 instance - T376961 [11:26:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:34] T376961: host rdb1014 is down - https://phabricator.wikimedia.org/T376961 [11:27:19] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1207', diff saved to https://phabricator.wikimedia.org/P69789 and previous config saved to /var/cache/conftool/dbconfig/20241014-112719-arnaudb.json [11:28:10] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P69790 and previous config saved to /var/cache/conftool/dbconfig/20241014-112809-ladsgroup.json [11:29:54] (03PS4) 10JMeybohm: wikikube: Prepare clusters for containerd workers [puppet] - 10https://gerrit.wikimedia.org/r/1079970 (https://phabricator.wikimedia.org/T362408) [11:30:14] Amir1: mind if I do a quick restbase deploy in between your syncs? [11:30:23] hnowlan: I am done [11:30:26] sorry I forgot to mention [11:30:31] RESOLVED: RedisReplicaDown: Redis replica down rdb1014:16379 redis_misc - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-site=eqiad&var-job=redis_misc&var-instance=rdb1014:16379 - https://alerts.wikimedia.org/?q=alertname%3DRedisReplicaDown [11:30:48] (03PS5) 10JMeybohm: wikikube: Prepare clusters for containerd workers [puppet] - 10https://gerrit.wikimedia.org/r/1079970 (https://phabricator.wikimedia.org/T362408) [11:30:55] !log andrewtavis-wmde@deploy2002 Started deploy [airflow-dags/wmde@c9a2532]: (no justification provided) [11:30:55] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1079970 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [11:31:01] !log andrewtavis-wmde@deploy2002 Finished deploy [airflow-dags/wmde@c9a2532]: (no justification provided) (duration: 00m 08s) [11:31:30] (03CR) 10Stevemunene: [C:03+2] Setup DPE Ceph alerts [alerts] - 10https://gerrit.wikimedia.org/r/1076460 (https://phabricator.wikimedia.org/T369583) (owner: 10Stevemunene) [11:31:55] np, thanks! [11:32:42] (03Merged) 10jenkins-bot: Setup DPE Ceph alerts [alerts] - 10https://gerrit.wikimedia.org/r/1076460 (https://phabricator.wikimedia.org/T369583) (owner: 10Stevemunene) [11:33:55] (03PS1) 10Ladsgroup: Update interwiki.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079984 [11:34:24] !log hnowlan@deploy2002 Started deploy [restbase/deploy@26112d4]: Remove unused AQS components. Add bdrwiki (T371761) [11:34:28] T371761: Add bdrwiki to RESTBase - https://phabricator.wikimedia.org/T371761 [11:39:42] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2219 (T367781)', diff saved to https://phabricator.wikimedia.org/P69791 and previous config saved to /var/cache/conftool/dbconfig/20241014-113941-arnaudb.json [11:39:45] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [11:42:26] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1207', diff saved to https://phabricator.wikimedia.org/P69792 and previous config saved to /var/cache/conftool/dbconfig/20241014-114225-arnaudb.json [11:42:40] (03CR) 10Hashar: [C:03+2] Merge tag 'v3.10.2' into wmf/stable-3.10 [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1079981 (https://phabricator.wikimedia.org/T373897) (owner: 10Hashar) [11:43:17] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T376905)', diff saved to https://phabricator.wikimedia.org/P69793 and previous config saved to /var/cache/conftool/dbconfig/20241014-114316-ladsgroup.json [11:43:21] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1173.eqiad.wmnet with reason: Maintenance [11:43:35] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1173.eqiad.wmnet with reason: Maintenance [11:43:42] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1173 (T376905)', diff saved to https://phabricator.wikimedia.org/P69794 and previous config saved to /var/cache/conftool/dbconfig/20241014-114341-ladsgroup.json [11:44:00] (03CR) 10Elukey: [C:03+2] Add aux-k8s-etcd1004 in service [puppet] - 10https://gerrit.wikimedia.org/r/1079534 (https://phabricator.wikimedia.org/T344230) (owner: 10Elukey) [11:45:51] !log Restarting MediaModeration scanning script - https://wikitech.wikimedia.org/wiki/MediaModeration [11:45:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:02] !log hnowlan@deploy2002 Finished deploy [restbase/deploy@26112d4]: Remove unused AQS components. Add bdrwiki (T371761) (duration: 15m 38s) [11:50:07] T371761: Add bdrwiki to RESTBase - https://phabricator.wikimedia.org/T371761 [11:50:54] !log btullis@cumin1002 START - Cookbook sre.wdqs.restart-nginx-envoy rolling restart_daemons on A:wcqs-public [11:51:35] (03CR) 10Elukey: [C:03+2] Add aux-k8s-etcd1004 to the aux-k8s SRV records [dns] - 10https://gerrit.wikimedia.org/r/1079539 (https://phabricator.wikimedia.org/T344230) (owner: 10Elukey) [11:51:37] !incidents [11:51:37] 5318 (UNACKED) db2149 (paged)/MariaDB Replica SQL: s3 (paged) [11:51:37] 5316 (RESOLVED) DDoSDetected sre (netflow5002:9100 eqsin) [11:51:38] 5317 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule) [11:51:38] 5315 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule) [11:51:38] 5314 (RESOLVED) ProbeDown sre (2001:df2:e500:ed1a::2:b ip6 upload-https:443 probes/service http_upload-https_ip6 eqsin) [11:51:38] 5313 (RESOLVED) db2147 (paged)/MariaDB Replica Lag: s4 (paged) [11:51:48] !ack 5318 [11:51:48] 5318 (ACKED) db2149 (paged)/MariaDB Replica SQL: s3 (paged) [11:52:15] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2194.codfw.wmnet onto db2227.codfw.wmnet [11:52:24] FIRING: SystemdUnitFailed: etcd.service on aux-k8s-etcd1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:52:25] Amir1, arnaudb related to any ongoing work? [11:52:36] the page for db2149 [11:52:37] !log btullis@cumin1002 END (PASS) - Cookbook sre.wdqs.restart-nginx-envoy (exit_code=0) rolling restart_daemons on A:wcqs-public [11:52:58] looks like it was being cloned? [11:53:39] also no page on IRC [11:53:44] that's weird [11:53:52] still haven't got a page from VO [11:54:11] :/ [11:54:14] (03CR) 10Btullis: [C:03+1] "Also looks good to me. Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1075163 (https://phabricator.wikimedia.org/T350143) (owner: 10Hnowlan) [11:54:26] (03Merged) 10jenkins-bot: Merge tag 'v3.10.2' into wmf/stable-3.10 [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1079981 (https://phabricator.wikimedia.org/T373897) (owner: 10Hashar) [11:54:37] the database server alerts seem to come out from nagios to alerts@w.o directly. [11:55:46] Emperor: so? they shoukd page here anyway [11:56:27] !log aborrero@cumin1002 START - Cookbook sre.dns.netbox [11:56:51] !log Started time limited scan on enwiki for MediaModeration - https://wikitech.wikimedia.org/wiki/MediaModeration [11:56:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:15] so it seems an s3 replica [11:57:24] RESOLVED: SystemdUnitFailed: etcd.service on aux-k8s-etcd1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:57:33] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1207 (T367781)', diff saved to https://phabricator.wikimedia.org/P69796 and previous config saved to /var/cache/conftool/dbconfig/20241014-115732-arnaudb.json [11:57:34] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1218.eqiad.wmnet with reason: Maintenance [11:57:36] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [11:57:48] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1218.eqiad.wmnet with reason: Maintenance [11:57:55] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1218 (T367781)', diff saved to https://phabricator.wikimedia.org/P69797 and previous config saved to /var/cache/conftool/dbconfig/20241014-115755-arnaudb.json [11:57:59] Slave: Index for table 'recentchanges' is corrupt; try to repair it Error_code: 1034 [11:58:08] there seems to be an issue in replication afaics [11:58:53] I've never depooled an mariadb instance but I'd say it is the case [11:58:56] checking docs [11:59:48] volans: it's not the clone, the same mariadb bug [11:59:51] !log aborrero@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudlb2004-dev cloud-private adddress - aborrero@cumin1002" [11:59:55] !log aborrero@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudlb2004-dev cloud-private adddress - aborrero@cumin1002" [11:59:55] !log aborrero@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:00:01] Amir1: o/ should I dbctl depool it? [12:00:11] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218 (T367781)', diff saved to https://phabricator.wikimedia.org/P69798 and previous config saved to /var/cache/conftool/dbconfig/20241014-120011-arnaudb.json [12:00:14] elukey: no :( I have depooled two already [12:00:29] we should fix it right away, let me do it [12:00:29] ack :( [12:00:32] <3 [12:00:36] lemme know how/if I can help [12:01:16] (03CR) 10Elukey: [C:03+2] Add aux-k8s-etcd1005 to the Aux k8s SRV records [dns] - 10https://gerrit.wikimedia.org/r/1079540 (https://phabricator.wikimedia.org/T344230) (owner: 10Elukey) [12:01:30] elukey: it should be fixed now [12:01:44] (03CR) 10Elukey: [C:03+2] Add aux-k8s-etcd1005 in service [puppet] - 10https://gerrit.wikimedia.org/r/1079535 (https://phabricator.wikimedia.org/T344230) (owner: 10Elukey) [12:01:51] (03PS2) 10Elukey: Add aux-k8s-etcd1005 in service [puppet] - 10https://gerrit.wikimedia.org/r/1079535 (https://phabricator.wikimedia.org/T344230) [12:01:53] (03CR) 10Elukey: [V:03+2 C:03+2] Add aux-k8s-etcd1005 in service [puppet] - 10https://gerrit.wikimedia.org/r/1079535 (https://phabricator.wikimedia.org/T344230) (owner: 10Elukey) [12:02:21] (03PS1) 10Arturo Borrero Gonzalez: cloudlb2004-dev: replace cloudlb2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/1079987 (https://phabricator.wikimedia.org/T377126) [12:02:32] !incodents [12:02:37] !incidents [12:02:38] 5318 (RESOLVED) db2149 (paged)/MariaDB Replica SQL: s3 (paged) [12:02:38] 5319 (RESOLVED) db2149 (paged)/MariaDB Replica Lag: s3 (paged) [12:02:38] 5316 (RESOLVED) DDoSDetected sre (netflow5002:9100 eqsin) [12:02:38] 5317 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule) [12:02:38] 5315 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule) [12:02:39] 5314 (RESOLVED) ProbeDown sre (2001:df2:e500:ed1a::2:b ip6 upload-https:443 probes/service http_upload-https_ip6 eqsin) [12:02:39] 5313 (RESOLVED) db2147 (paged)/MariaDB Replica Lag: s4 (paged) [12:02:41] yep [12:02:46] Amir1: <3 [12:02:58] thanks Amir1! [12:04:22] (03CR) 10CI reject: [V:04-1] cloudlb2004-dev: replace cloudlb2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/1079987 (https://phabricator.wikimedia.org/T377126) (owner: 10Arturo Borrero Gonzalez) [12:06:55] (03PS2) 10Arturo Borrero Gonzalez: cloudlb2004-dev: replace cloudlb2001-dev [puppet] - 10https://gerrit.wikimedia.org/r/1079987 (https://phabricator.wikimedia.org/T377126) [12:08:54] FIRING: [2x] SystemdUnitFailed: etcd.service on aux-k8s-etcd1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:09:48] !log increase etcd k8s aux cluster from 3 -> 5 - T344230 [12:09:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:51] T344230: Get aux-k8s cluster row-redundant and with more workers - https://phabricator.wikimedia.org/T344230 [12:12:05] (03CR) 10Ayounsi: [C:03+1] "lgtm!" [homer/public] - 10https://gerrit.wikimedia.org/r/1079982 (https://phabricator.wikimedia.org/T245495) (owner: 10Cathal Mooney) [12:13:54] RESOLVED: [2x] SystemdUnitFailed: etcd.service on aux-k8s-etcd1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:15:18] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218', diff saved to https://phabricator.wikimedia.org/P69799 and previous config saved to /var/cache/conftool/dbconfig/20241014-121518-arnaudb.json [12:19:44] (03CR) 10Hnowlan: [V:03+1 C:03+2] aqs: remove AQSv1 service components [puppet] - 10https://gerrit.wikimedia.org/r/1075163 (https://phabricator.wikimedia.org/T350143) (owner: 10Hnowlan) [12:21:00] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079997 [12:22:01] (03PS1) 10Elukey: Remove aux-k8x-{ctrl,worker}1001 from production [puppet] - 10https://gerrit.wikimedia.org/r/1079999 (https://phabricator.wikimedia.org/T344230) [12:22:24] !log elukey@puppetserver1001 conftool action : set/pooled=inactive; selector: name=aux-k8s-ctrl1001.eqiad.wmnet [12:22:30] !log elukey@puppetserver1001 conftool action : set/pooled=inactive; selector: name=aux-k8s-worker1001.eqiad.wmnet [12:23:56] !log elukey@cumin1002 START - Cookbook sre.hosts.decommission for hosts aux-k8s-ctrl1001.eqiad.wmnet [12:24:45] (03CR) 10Cathal Mooney: [C:03+2] Remove WMCS codfw prefix from CR aggregate conf and adjust outfilter [homer/public] - 10https://gerrit.wikimedia.org/r/1079982 (https://phabricator.wikimedia.org/T245495) (owner: 10Cathal Mooney) [12:26:01] (03Merged) 10jenkins-bot: Remove WMCS codfw prefix from CR aggregate conf and adjust outfilter [homer/public] - 10https://gerrit.wikimedia.org/r/1079982 (https://phabricator.wikimedia.org/T245495) (owner: 10Cathal Mooney) [12:28:49] !log elukey@cumin1002 START - Cookbook sre.dns.netbox [12:30:25] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218', diff saved to https://phabricator.wikimedia.org/P69800 and previous config saved to /var/cache/conftool/dbconfig/20241014-123025-arnaudb.json [12:30:40] FIRING: KubernetesRsyslogDown: rsyslog on aux-k8s-worker1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=aux-k8s-worker1001 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:30:57] !log removed all aqsv1 service components from aqs* hosts [12:30:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:19] !log elukey@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: aux-k8s-ctrl1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - elukey@cumin1002" [12:32:23] jouncebot: next [12:32:23] In 0 hour(s) and 27 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241014T1300) [12:32:38] !log elukey@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: aux-k8s-ctrl1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - elukey@cumin1002" [12:32:38] !log elukey@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:32:38] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts aux-k8s-ctrl1001.eqiad.wmnet [12:32:52] !log elukey@cumin1002 START - Cookbook sre.hosts.decommission for hosts aux-k8s-worker1001.eqiad.wmnet [12:35:18] (03PS1) 10Hashar: Gerrit 3.10.2 and rebuild plugins [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1080009 (https://phabricator.wikimedia.org/T373897) [12:37:44] !log elukey@cumin1002 START - Cookbook sre.dns.netbox [12:38:53] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db2227 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P69801 and previous config saved to /var/cache/conftool/dbconfig/20241014-123853-ladsgroup.json [12:40:59] !log elukey@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: aux-k8s-worker1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - elukey@cumin1002" [12:43:14] (03CR) 10Elukey: "Added two suggestions to use urllib parse directly, I'll stand by until you are ready for review!" [puppet] - 10https://gerrit.wikimedia.org/r/1079973 (https://phabricator.wikimedia.org/T375789) (owner: 10Jcrespo) [12:43:28] !log elukey@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: aux-k8s-worker1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - elukey@cumin1002" [12:43:29] !log elukey@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:43:29] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts aux-k8s-worker1001.eqiad.wmnet [12:43:58] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1173 (T376905)', diff saved to https://phabricator.wikimedia.org/P69802 and previous config saved to /var/cache/conftool/dbconfig/20241014-124357-ladsgroup.json [12:44:25] !log aqu@deploy2002 Started deploy [airflow-dags/analytics_test@4b69f50]: Stage Refine fixes on test cluster [airflow-dags@4b69f503] [12:44:38] !log aqu@deploy2002 Finished deploy [airflow-dags/analytics_test@4b69f50]: Stage Refine fixes on test cluster [airflow-dags@4b69f503] (duration: 00m 12s) [12:45:26] (03PS2) 10Hashar: Gerrit 3.10.2 and rebuild plugins [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1080009 (https://phabricator.wikimedia.org/T373897) [12:45:32] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1218 (T367781)', diff saved to https://phabricator.wikimedia.org/P69803 and previous config saved to /var/cache/conftool/dbconfig/20241014-124532-arnaudb.json [12:45:34] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db1219.eqiad.wmnet with reason: Maintenance [12:45:36] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [12:45:47] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1219.eqiad.wmnet with reason: Maintenance [12:45:52] (03CR) 10Hashar: [C:03+2] Gerrit 3.10.2 and rebuild plugins [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1080009 (https://phabricator.wikimedia.org/T373897) (owner: 10Hashar) [12:45:54] (03PS1) 10Elukey: kubernetes: change the AUX etcd urls nodes [puppet] - 10https://gerrit.wikimedia.org/r/1080011 (https://phabricator.wikimedia.org/T344230) [12:45:54] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1219 (T367781)', diff saved to https://phabricator.wikimedia.org/P69804 and previous config saved to /var/cache/conftool/dbconfig/20241014-124554-arnaudb.json [12:46:25] (03Merged) 10jenkins-bot: Gerrit 3.10.2 and rebuild plugins [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1080009 (https://phabricator.wikimedia.org/T373897) (owner: 10Hashar) [12:47:53] 10SRE-tools, 06Data-Persistence-SRE, 06DBA, 10Spicerack: mariadb: systemctl status accessor in mysql_legacy - https://phabricator.wikimedia.org/T377129 (10ABran-WMF) 03NEW [12:47:56] (03CR) 10Ayounsi: [C:03+1] Remove aux-k8x-{ctrl,worker}1001 from production [puppet] - 10https://gerrit.wikimedia.org/r/1079999 (https://phabricator.wikimedia.org/T344230) (owner: 10Elukey) [12:48:00] (03PS6) 10JMeybohm: wikikube: Prepare clusters for containerd workers [puppet] - 10https://gerrit.wikimedia.org/r/1079970 (https://phabricator.wikimedia.org/T362408) [12:48:08] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1079970 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [12:48:51] (03CR) 10Elukey: [C:03+2] Remove aux-k8x-{ctrl,worker}1001 from production [puppet] - 10https://gerrit.wikimedia.org/r/1079999 (https://phabricator.wikimedia.org/T344230) (owner: 10Elukey) [12:48:58] 10SRE-tools, 06Data-Persistence-SRE, 06DBA, 06Infrastructure-Foundations, 10Spicerack: mariadb: systemctl status accessor in mysql_legacy - https://phabricator.wikimedia.org/T377129#10225683 (10ABran-WMF) p:05Triage→03Medium [12:49:00] 10SRE-tools, 06Data-Persistence-SRE, 06DBA, 06Infrastructure-Foundations, 10Spicerack: mariadb: systemctl status accessor in mysql_legacy - https://phabricator.wikimedia.org/T377129#10225686 (10ABran-WMF) [12:49:15] (03CR) 10Elukey: [C:03+2] kubernetes: change the AUX etcd urls nodes [puppet] - 10https://gerrit.wikimedia.org/r/1080011 (https://phabricator.wikimedia.org/T344230) (owner: 10Elukey) [12:49:30] (03PS1) 10Ayounsi: re-image: ask user about migrating to per-rack vlan/IP [cookbooks] - 10https://gerrit.wikimedia.org/r/1080012 [12:50:14] (03PS2) 10Ayounsi: re-image: ask user about migrating to per-rack vlan/IP [cookbooks] - 10https://gerrit.wikimedia.org/r/1080012 [12:53:59] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db2227 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P69805 and previous config saved to /var/cache/conftool/dbconfig/20241014-125358-ladsgroup.json [12:56:28] 06SRE, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, 10netops: openstack: initial IPv6 support in neutron - https://phabricator.wikimedia.org/T375847#10225742 (10aborrero) >>! In T375847#10217807, @cmooney wrote: >>>! In T375847#10195673, @aborrero wrote: >> `lang=shell-session >> roo... [12:57:21] (03PS1) 10Elukey: Remove aux-k8s-etcd100[1,2] from the AUX client SRV records [dns] - 10https://gerrit.wikimedia.org/r/1080016 (https://phabricator.wikimedia.org/T344230) [12:57:22] (03PS1) 10Elukey: Remove aux-k8s-etcd1001 from the AUX cluster's SRV records [dns] - 10https://gerrit.wikimedia.org/r/1080017 (https://phabricator.wikimedia.org/T344230) [12:57:24] (03PS1) 10Elukey: Remove aux-k8s-etcd1002 from the AUX cluster's SRV records [dns] - 10https://gerrit.wikimedia.org/r/1080018 (https://phabricator.wikimedia.org/T344230) [12:57:32] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, October 14 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/GrowthExperiments] (wmf/1.43.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1079923 (owner: 10Michael Große) [12:57:54] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, October 14 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/GrowthExperiments] (wmf/1.43.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1079894 (owner: 10Michael Große) [12:58:11] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, October 14 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/GrowthExperiments] (wmf/1.43.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1079915 (https://phabricator.wikimedia.org/T364341) (owner: 10Michael Große) [12:58:24] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, October 14 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/GrowthExperiments] (wmf/1.43.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1079925 (https://phabricator.wikimedia.org/T373176) (owner: 10Michael Große) [12:59:04] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1173', diff saved to https://phabricator.wikimedia.org/P69806 and previous config saved to /var/cache/conftool/dbconfig/20241014-125904-ladsgroup.json [12:59:23] (03CR) 10Lucas Werkmeister (WMDE): "I’m confused, I don’t see how this is different from the change Ie3906e3b67 that had to be reverted… you say you forgot to update `wgMetaN" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079619 (https://phabricator.wikimedia.org/T362620) (owner: 10NMW03) [13:00:04] Lucas_WMDE, Urbanecm, awight, and TheresNoTime: That opportune time for a UTC afternoon backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241014T1300). [13:00:05] Daimona, Nemoralis, and MichaelG_WMF: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:13] o/ [13:00:16] o/ [13:00:25] jouncebot: now [13:00:25] For the next 0 hour(s) and 59 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241014T1300) [13:00:25] o/ [13:00:27] o/ [13:00:35] ooh, a lot of patches suddenly appeared [13:00:46] My changes can be deployed alltogether and are not testable [13:00:56] lots of PHP notice in logspam-watch, meh [13:01:25] maybe the change to the maint script is in principle, because it changes the output, but no idea how that works with deployment hosts [13:01:38] not sure either [13:01:41] @Lucas_WMDE Where can I see the PHP notices? [13:01:42] I guess I could scap pull on mwmaint [13:02:00] MichaelG_WMF: I see them in logspam-watch on mwlog2002 [13:02:08] presumably they’re also in logstash somewhere [13:02:14] but maybe not in mediawiki-errors, not sure [13:02:16] they’re from BackupDumper [13:02:29] anyway, let’s start with Daimona [13:02:35] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079518 (https://phabricator.wikimedia.org/T376695) (owner: 10Daimona Eaytoy) [13:02:50] out of curiosity, why are so many wikis getting CampaignEvents enabled recently? is it a new extension? ^^ [13:03:12] We've just gotten a lot of traction [13:03:18] (03Merged) 10jenkins-bot: [uawikimedia] Enable the CampaignEvents extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079518 (https://phabricator.wikimedia.org/T376695) (owner: 10Daimona Eaytoy) [13:03:20] Lucas_WMDE: I just saw your comment [13:03:35] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1079518|[uawikimedia] Enable the CampaignEvents extension (T376695)]] [13:03:38] T376695: Enable CampaignEvents Extension on Wikimedia Ukraine's wiki [Oct 14] - https://phabricator.wikimedia.org/T376695 [13:03:43] Nemoralis: yeah, I was just looking at it before the window started [13:03:45] I somehow thought I had forgotten it [13:03:46] ^that. It's somewhat new but not really new [13:03:52] ok ^^ [13:04:05] then what could be the reason for the previous change not working [13:04:13] no idea :/ [13:04:18] (03CR) 10NMW03: "o_O" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079619 (https://phabricator.wikimedia.org/T362620) (owner: 10NMW03) [13:05:11] MichaelG_WMF: I see the fwrite notices in mediawiki-errors in logstash too fwiw [13:05:27] * Lucas_WMDE makes a task [13:05:51] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, daimona: Backport for [[gerrit:1079518|[uawikimedia] Enable the CampaignEvents extension (T376695)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:05:54] Lucas_WMDE: thanks, now I've also connected mwlog2002 and am looking at logspam watch [13:05:54] Lucas_WMDE: maybe because deployer didn't run the maintenance script [13:05:57] I remember that [13:06:41] (03CR) 10Elukey: [C:03+2] Remove aux-k8s-etcd100[1,2] from the AUX client SRV records [dns] - 10https://gerrit.wikimedia.org/r/1080016 (https://phabricator.wikimedia.org/T344230) (owner: 10Elukey) [13:06:41] it was Urbanecm iirc, he didn't run the maintenance script [13:06:55] * urbanecm was summoned [13:06:59] hi [13:07:05] Daimona: please test :) [13:08:13] Nemoralis: can you clarify what recent action of mine are you referring to? [13:08:28] LGTM @Daimona [13:08:38] urbanecm: we are talking about this patch [13:08:38] https://gerrit.wikimedia.org/r/q/Ie3906e3b67 [13:08:47] Yep LGTM too [13:09:04] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db2227 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P69807 and previous config saved to /var/cache/conftool/dbconfig/20241014-130904-ladsgroup.json [13:09:15] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, daimona: Continuing with sync [13:09:18] ok, thanks! [13:09:21] Thank you! [13:09:26] filed T377136 for the logspam-watch fwrite errors btw [13:09:27] T377136: PHP Notice: fwrite(): write of X bytes failed with errno=32 Broken pipe - https://phabricator.wikimedia.org/T377136 [13:12:06] (03CR) 10Elukey: [C:03+2] Remove aux-k8s-etcd1001 from the AUX cluster's SRV records [dns] - 10https://gerrit.wikimedia.org/r/1080017 (https://phabricator.wikimedia.org/T344230) (owner: 10Elukey) [13:12:25] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:12:57] “SUCCESS in 53m 08s” ._. [13:13:07] does GrowthExperiments CI usually take this long or did that build get unlucky? [13:13:13] anyway, let’s +2 those backports already [13:13:25] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "start gate-and-submit ahead of deployment" [extensions/GrowthExperiments] (wmf/1.43.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1079923 (owner: 10Michael Große) [13:13:29] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "start gate-and-submit ahead of deployment" [extensions/GrowthExperiments] (wmf/1.43.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1079894 (owner: 10Michael Große) [13:13:35] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "start gate-and-submit ahead of deployment" [extensions/GrowthExperiments] (wmf/1.43.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1079915 (https://phabricator.wikimedia.org/T364341) (owner: 10Michael Große) [13:13:42] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "start gate-and-submit ahead of deployment" [extensions/GrowthExperiments] (wmf/1.43.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1079925 (https://phabricator.wikimedia.org/T373176) (owner: 10Michael Große) [13:13:54] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1079518|[uawikimedia] Enable the CampaignEvents extension (T376695)]] (duration: 10m 19s) [13:13:59] T376695: Enable CampaignEvents Extension on Wikimedia Ukraine's wiki [Oct 14] - https://phabricator.wikimedia.org/T376695 [13:14:11] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1173', diff saved to https://phabricator.wikimedia.org/P69808 and previous config saved to /var/cache/conftool/dbconfig/20241014-131411-ladsgroup.json [13:14:19] Lucas_WMDE: I think that build was particularly unlucky, but it does take a long time [13:14:23] ok [13:14:31] (03CR) 10Elukey: [C:03+2] Remove aux-k8s-etcd1002 from the AUX cluster's SRV records [dns] - 10https://gerrit.wikimedia.org/r/1080018 (https://phabricator.wikimedia.org/T344230) (owner: 10Elukey) [13:14:43] Lucas_WMDE: just checked the irc archive, yes we didn't run the maintenance script [13:14:53] is it required? I don't remember [13:14:57] running the phpunit tests in parallel really makes a big difference for GrowthExperiments, hopefully this will be widely available soon [13:15:27] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on aux-k8s-etcd1001.eqiad.wmnet with reason: about to decom [13:15:31] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on aux-k8s-etcd1001.eqiad.wmnet with reason: about to decom [13:15:52] also, I am not sure if mediawiki recognizes the different quote on namespace [13:16:13] Nemoralis: I just looked at the archive too (https://wm-bot.wmcloud.org/browser/index.php?start=04%2F24%2F2024&end=04%2F24%2F2024&display=%23wikimedia-operations) [13:16:15] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on aux-k8s-etcd1002.eqiad.wmnet with reason: about to decom [13:16:23] it is Vikilug‘at => Vikilugʻat [13:16:23] it’s not really clear what didn’t work in the first place IMHO [13:16:29] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on aux-k8s-etcd1002.eqiad.wmnet with reason: about to decom [13:16:40] you just said something didn’t work and then it was reverted [13:17:47] let me try to find another config change that did the same for another wiki [13:17:48] for comparison [13:18:09] I said "didn't work" in the sense that I see the old namespace instead of new [13:18:09] !log elukey@cumin1002 START - Cookbook sre.hosts.decommission for hosts aux-k8s-etcd1001.eqiad.wmnet [13:18:32] you can check https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1076254 [13:18:34] (03CR) 10Jcrespo: "Funny how the tables have turned :-) - but please use proper sql argument interpolation on query, not fstrings (provide a tuple). Will do " [cookbooks] - 10https://gerrit.wikimedia.org/r/1079536 (https://phabricator.wikimedia.org/T375144) (owner: 10Volans) [13:18:36] it is same thing [13:18:44] (03CR) 10Arnaudb: [C:03+2] mariadb: add data directory accessor [software/spicerack] - 10https://gerrit.wikimedia.org/r/1078616 (https://phabricator.wikimedia.org/T376701) (owner: 10Arnaudb) [13:19:24] FIRING: SystemdUnitFailed: mwscript-cleanup.service on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:19:28] (03PS2) 10Arnaudb: mariadb: get systemd status for instance [software/spicerack] - 10https://gerrit.wikimedia.org/r/1080019 (https://phabricator.wikimedia.org/T377129) [13:20:19] (03PS1) 10Elukey: Remove aux-k8s-etcd100[1,2] from production [puppet] - 10https://gerrit.wikimedia.org/r/1080022 (https://phabricator.wikimedia.org/T344230) [13:21:18] (03CR) 10Volans: "@jcrespo@wikimedia.org as we're running the query via ssh on the CLI we don't have a client able to perform proper interpolation. What wou" [cookbooks] - 10https://gerrit.wikimedia.org/r/1079536 (https://phabricator.wikimedia.org/T375144) (owner: 10Volans) [13:21:55] (03CR) 10Jcrespo: [C:04-1] "Thank you elukey, I thought the lib would only provide url parsing, not relative url resolution, but you showed me that it does. Amending." [puppet] - 10https://gerrit.wikimedia.org/r/1079973 (https://phabricator.wikimedia.org/T375789) (owner: 10Jcrespo) [13:22:40] !log elukey@cumin1002 START - Cookbook sre.dns.netbox [13:23:28] (03PS5) 10Hashar: contint: define component/ci only once [puppet] - 10https://gerrit.wikimedia.org/r/1074468 (https://phabricator.wikimedia.org/T375278) [13:23:28] (03CR) 10Hashar: "I was waiting for the parent change to merge in order to review the catalogue compilation. The only oddity I found was `Package[jenkins]` " [puppet] - 10https://gerrit.wikimedia.org/r/1074468 (https://phabricator.wikimedia.org/T375278) (owner: 10Hashar) [13:24:09] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db2227 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P69809 and previous config saved to /var/cache/conftool/dbconfig/20241014-132409-ladsgroup.json [13:25:00] Nemoralis: if I understand correctly, existing pages should have shown the new namespace name without the maintenance script [13:25:09] (though you might have to access them via ?curid=) [13:26:04] yes I said "didn't work" because I saw the old namespace instead of new [13:26:09] !log elukey@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: aux-k8s-etcd1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - elukey@cumin1002" [13:26:16] yeah… then I don’t know what we would need to fix [13:26:22] since it sounds like it wasn’t just the maintenance script missing [13:26:28] (03PS1) 10Ammarpad: contactpages: Move stewards contactpage to MetaContactPages.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080023 [13:26:28] !log elukey@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: aux-k8s-etcd1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - elukey@cumin1002" [13:26:28] !log elukey@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:26:29] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts aux-k8s-etcd1001.eqiad.wmnet [13:26:51] the code is same as https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1076254 [13:26:54] !log elukey@cumin1002 START - Cookbook sre.hosts.decommission for hosts aux-k8s-etcd1002.eqiad.wmnet [13:26:54] I am not sure either [13:26:59] (03PS2) 10Ammarpad: contactpages: Move stewards contactpage to MetaContactPages.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080023 [13:27:21] “(though you might have to access them via ?curid=)” – disregard that part, that’s what the namespace alias is for, nevermind [13:27:31] I think we have to skip this config change then :/ [13:27:40] (03CR) 10CI reject: [V:04-1] contactpages: Move stewards contactpage to MetaContactPages.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080023 (owner: 10Ammarpad) [13:27:49] ok, I will recheck it then [13:28:22] good luck… [13:28:34] (03CR) 10FNegri: alertmanager: fix WMCS template (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1077038 (https://phabricator.wikimedia.org/T375479) (owner: 10FNegri) [13:28:45] (03PS1) 10Elukey: admin_ng: remove ad-hoc anti-affinity rules for Calico typha in AUX [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080024 (https://phabricator.wikimedia.org/T333302) [13:28:51] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.43.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1079923 (owner: 10Michael Große) [13:28:51] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.43.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1079894 (owner: 10Michael Große) [13:28:52] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.43.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1079915 (https://phabricator.wikimedia.org/T364341) (owner: 10Michael Große) [13:28:54] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.43.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1079925 (https://phabricator.wikimedia.org/T373176) (owner: 10Michael Große) [13:29:15] (03PS7) 10FNegri: alertmanager: fix WMCS template [puppet] - 10https://gerrit.wikimedia.org/r/1077038 (https://phabricator.wikimedia.org/T375479) [13:29:18] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1173 (T376905)', diff saved to https://phabricator.wikimedia.org/P69810 and previous config saved to /var/cache/conftool/dbconfig/20241014-132918-ladsgroup.json [13:29:21] (03PS3) 10Ammarpad: contactpages: Move stewards contactpage to MetaContactPages.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080023 [13:29:23] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1180.eqiad.wmnet with reason: Maintenance [13:29:37] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1180.eqiad.wmnet with reason: Maintenance [13:29:44] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1180 (T376905)', diff saved to https://phabricator.wikimedia.org/P69811 and previous config saved to /var/cache/conftool/dbconfig/20241014-132944-ladsgroup.json [13:30:35] (03Merged) 10jenkins-bot: mariadb: add data directory accessor [software/spicerack] - 10https://gerrit.wikimedia.org/r/1078616 (https://phabricator.wikimedia.org/T376701) (owner: 10Arnaudb) [13:31:26] !log elukey@cumin1002 START - Cookbook sre.dns.netbox [13:33:03] (03PS4) 10Jcrespo: check footer legal complience: Add support for relative URLs [puppet] - 10https://gerrit.wikimedia.org/r/1079973 (https://phabricator.wikimedia.org/T375789) [13:33:36] (03PS6) 10Arnaudb: mysql_legacy: double quote escape in run_query [software/spicerack] - 10https://gerrit.wikimedia.org/r/1078658 (https://phabricator.wikimedia.org/T376712) [13:33:52] (03CR) 10CI reject: [V:04-1] check footer legal complience: Add support for relative URLs [puppet] - 10https://gerrit.wikimedia.org/r/1079973 (https://phabricator.wikimedia.org/T375789) (owner: 10Jcrespo) [13:34:09] (03CR) 10Jcrespo: "used urllib.parse.urljoin()" [puppet] - 10https://gerrit.wikimedia.org/r/1079973 (https://phabricator.wikimedia.org/T375789) (owner: 10Jcrespo) [13:34:17] (03CR) 10Arnaudb: mysql_legacy: double quote escape in run_query (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1078658 (https://phabricator.wikimedia.org/T376712) (owner: 10Arnaudb) [13:34:49] !log elukey@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: aux-k8s-etcd1002.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - elukey@cumin1002" [13:34:50] (03PS5) 10Jcrespo: check footer legal complience: Add support for relative URLs [puppet] - 10https://gerrit.wikimedia.org/r/1079973 (https://phabricator.wikimedia.org/T375789) [13:35:04] !log elukey@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: aux-k8s-etcd1002.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - elukey@cumin1002" [13:35:04] !log elukey@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:35:05] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts aux-k8s-etcd1002.eqiad.wmnet [13:35:35] jouncebot: nowandnext [13:35:35] For the next 0 hour(s) and 24 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241014T1300) [13:35:35] In 1 hour(s) and 54 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241014T1530) [13:35:54] 10SRE-tools, 06Data-Persistence-SRE, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review: mysql_legacy data_directory getter - https://phabricator.wikimedia.org/T376701#10225935 (10ABran-WMF) 05In progress→03Resolved [13:36:18] (03PS6) 10Jcrespo: check footer legal complience: Add support for relative URLs [puppet] - 10https://gerrit.wikimedia.org/r/1079973 (https://phabricator.wikimedia.org/T375789) [13:36:18] Amir1: I’m waiting for CI to finish on some backports [13:37:00] (03PS4) 10Ammarpad: contactpages: Move stewards contactpage to MetaContactPages.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080023 [13:37:01] thanks. Please let me know once you're done, if you can squeeze https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1079984 that'd be amazing [13:37:09] (it should be fast) [13:37:18] zuul says ETA 10 minutes [13:37:27] I think you could try to squeeze that in now [13:37:36] if I ctrl+c my scap backport [13:37:56] thanks [13:38:04] alright, I’ve ctrl+c’ed mine [13:38:06] go ahead [13:38:10] (03PS1) 10Brouberol: cloudnative_pg: monitor daily rclone sync of PG S3 buckets [alerts] - 10https://gerrit.wikimedia.org/r/1080027 (https://phabricator.wikimedia.org/T377112) [13:38:14] and the CI can just continue [13:38:15] (03CR) 10Ladsgroup: [C:03+2] Update interwiki.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079984 (owner: 10Ladsgroup) [13:38:22] ETA 10 minutes is not realistic, more like 20 minutes [13:38:25] ok [13:38:25] sadly [13:38:30] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079984 (owner: 10Ladsgroup) [13:39:15] (03Merged) 10jenkins-bot: Update interwiki.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079984 (owner: 10Ladsgroup) [13:39:16] (Where are those zuul estimates coming from anyway? They seem so often way too low for some repositories) [13:39:21] (03PS1) 10Giuseppe Lavagetto: hiddenparma: fix whitespace in hiddenparma-default.erb [puppet] - 10https://gerrit.wikimedia.org/r/1080028 [13:39:29] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1079984|Update interwiki.php]] [13:40:38] (03CR) 10Giuseppe Lavagetto: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4301/co" [puppet] - 10https://gerrit.wikimedia.org/r/1080028 (owner: 10Giuseppe Lavagetto) [13:40:57] no idea [13:41:27] (03CR) 10Giuseppe Lavagetto: [V:03+1 C:03+2] hiddenparma: fix whitespace in hiddenparma-default.erb [puppet] - 10https://gerrit.wikimedia.org/r/1080028 (owner: 10Giuseppe Lavagetto) [13:41:43] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1079984|Update interwiki.php]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:41:49] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [13:44:17] (03CR) 10CI reject: [V:04-1] mysql_legacy: double quote escape in run_query [software/spicerack] - 10https://gerrit.wikimedia.org/r/1078658 (https://phabricator.wikimedia.org/T376712) (owner: 10Arnaudb) [13:44:21] !log kcvelaga@deploy2002 Started deploy [airflow-dags/analytics_product@fbcf880]: T375480 [13:44:25] T375480: ETL pipeline for Automoderator monthly key metrics - https://phabricator.wikimedia.org/T375480 [13:44:47] (03PS7) 10Arnaudb: mysql_legacy: double quote escape in run_query [software/spicerack] - 10https://gerrit.wikimedia.org/r/1078658 (https://phabricator.wikimedia.org/T376712) [13:45:24] !log kcvelaga@deploy2002 Finished deploy [airflow-dags/analytics_product@fbcf880]: T375480 (duration: 01m 07s) [13:46:29] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1079984|Update interwiki.php]] (duration: 07m 00s) [13:47:33] Amir1: all done? [13:47:50] (03PS1) 10Giuseppe Lavagetto: wikimedia.org: add requestctl [dns] - 10https://gerrit.wikimedia.org/r/1080029 (https://phabricator.wikimedia.org/T371782) [13:48:28] yup [13:48:35] ok, resuming my scap then [13:48:38] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.43.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1079923 (owner: 10Michael Große) [13:48:39] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.43.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1079894 (owner: 10Michael Große) [13:48:39] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.43.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1079915 (https://phabricator.wikimedia.org/T364341) (owner: 10Michael Große) [13:48:41] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.43.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1079925 (https://phabricator.wikimedia.org/T373176) (owner: 10Michael Große) [13:49:12] 06SRE, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, 10netops: openstack: initial IPv6 support in neutron - https://phabricator.wikimedia.org/T375847#10225985 (10cmooney) We definitely want to use DHCPv6 (stateful) for address assignment. So OpenStack is in control of what IPs are us... [13:49:28] (03CR) 10Giuseppe Lavagetto: [C:03+2] wikimedia.org: add requestctl [dns] - 10https://gerrit.wikimedia.org/r/1080029 (https://phabricator.wikimedia.org/T371782) (owner: 10Giuseppe Lavagetto) [13:50:47] 14SRE-Sprint-Week-Sustainability-March2023, 06serviceops, 10Sustainability (Incident Followup): Expand upon Kask/Sessionstore documentation - https://phabricator.wikimedia.org/T320398#10225997 (10hnowlan) a:05hnowlan→03None [13:56:26] (03PS7) 10Brouberol: Define the ceph-csi-cephfs admin_ng helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077878 (https://phabricator.wikimedia.org/T376406) [13:56:26] (03PS1) 10Brouberol: ceph-csi-cephfs: replace the ClusterRole by a list of ns-scoped Roles [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080032 (https://phabricator.wikimedia.org/T376406) [13:58:13] !log stevemunene@cumin1002 START - Cookbook sre.hosts.decommission for hosts an-worker1176.eqiad.wmnet [13:58:36] (03Merged) 10jenkins-bot: refactor(tests): don't use per-method coverage annotation [extensions/GrowthExperiments] (wmf/1.43.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1079923 (owner: 10Michael Große) [13:59:16] (03Merged) 10jenkins-bot: refactor(HomepageHooks): extract method for simpler modifyability [extensions/GrowthExperiments] (wmf/1.43.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1079894 (owner: 10Michael Große) [13:59:29] almost there… [13:59:33] jouncebot: next [13:59:33] In 1 hour(s) and 30 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241014T1530) [13:59:39] ok we have some time left [14:00:16] (03CR) 10Btullis: [C:03+1] "This looks good to me. Thanks. This addresses the issue of the CSI plugin having privileges that are too high across the cluster. Adding e" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080032 (https://phabricator.wikimedia.org/T376406) (owner: 10Brouberol) [14:00:48] * MichaelG_WMF is twiddeling their thumbs looking at the last change that is still going [14:01:30] (03CR) 10Btullis: [C:03+1] "Nice. Thanks." [alerts] - 10https://gerrit.wikimedia.org/r/1080027 (https://phabricator.wikimedia.org/T377112) (owner: 10Brouberol) [14:01:38] 06SRE, 06cloud-services-team, 10Cloud-VPS, 06Infrastructure-Foundations, 10netops: openstack: initial IPv6 support in neutron - https://phabricator.wikimedia.org/T375847#10226051 (10cmooney) So.... maybe this is normal for DHCPv6? Re-reading the reddit post and looking at the setup on the VM it seems li... [14:02:11] (03PS1) 10Giuseppe Lavagetto: idp: add entry for requesctl.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1080034 (https://phabricator.wikimedia.org/T371782) [14:02:20] (03Merged) 10jenkins-bot: Clear LinkRecommendation suggestions on page save [extensions/GrowthExperiments] (wmf/1.43.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1079915 (https://phabricator.wikimedia.org/T364341) (owner: 10Michael Große) [14:02:22] (03Merged) 10jenkins-bot: Run fixLinkRecommendationData even when disabled in CC [extensions/GrowthExperiments] (wmf/1.43.0-wmf.26) - 10https://gerrit.wikimedia.org/r/1079925 (https://phabricator.wikimedia.org/T373176) (owner: 10Michael Große) [14:02:29] wheeee [14:02:32] \o/ [14:02:42] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1079923|refactor(tests): don't use per-method coverage annotation]], [[gerrit:1079894|refactor(HomepageHooks): extract method for simpler modifyability]], [[gerrit:1079915|Clear LinkRecommendation suggestions on page save (T364341 T372337)]], [[gerrit:1079925|Run fixLinkRecommendationData even when disabled in CC (T373176)]] [14:02:48] T364341: [wmf.3] Special:Homepage - high rate for dangling db records - https://phabricator.wikimedia.org/T364341 [14:02:49] T372337: High number of dangling search index results at fr.wikipedia or it.wikipedia - https://phabricator.wikimedia.org/T372337 [14:02:49] T373176: fixLinkRecommendationData.php does not run when link-recommendation task type is disabled - https://phabricator.wikimedia.org/T373176 [14:03:14] !log stevemunene@cumin1002 START - Cookbook sre.dns.netbox [14:03:33] (03CR) 10Btullis: "Thanks elukey. This has now been addressed in: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1080032" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077872 (https://phabricator.wikimedia.org/T376406) (owner: 10Brouberol) [14:04:41] (03CR) 10Btullis: [C:03+1] "I'm happy with this. We might also want to replicate this functionality in the RBD version and/or send this upstream." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078387 (https://phabricator.wikimedia.org/T376406) (owner: 10Brouberol) [14:04:49] !log lucaswerkmeister-wmde@deploy2002 migr, lucaswerkmeister-wmde: Backport for [[gerrit:1079923|refactor(tests): don't use per-method coverage annotation]], [[gerrit:1079894|refactor(HomepageHooks): extract method for simpler modifyability]], [[gerrit:1079915|Clear LinkRecommendation suggestions on page save (T364341 T372337)]], [[gerrit:1079925|Run fixLinkRecommendationData even when disabled in CC (T373176)]] synced to [14:04:49] the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:04:57] !log lucaswerkmeister-wmde@deploy2002 migr, lucaswerkmeister-wmde: Continuing with sync [14:06:10] nothing should change from these patches, so if the logs are fine, we can move forward I think. [14:06:31] (this will take a config change to toggle the feature flag for there to be a change over time) [14:06:41] !log stevemunene@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: an-worker1176.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - stevemunene@cumin1002" [14:07:06] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: an-worker1176.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - stevemunene@cumin1002" [14:07:06] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:07:07] !log stevemunene@cumin1002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts an-worker1176.eqiad.wmnet [14:07:51] !log stevemunene@cumin1002 START - Cookbook sre.hosts.decommission for hosts an-worker1177.eqiad.wmnet [14:09:31] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1079923|refactor(tests): don't use per-method coverage annotation]], [[gerrit:1079894|refactor(HomepageHooks): extract method for simpler modifyability]], [[gerrit:1079915|Clear LinkRecommendation suggestions on page save (T364341 T372337)]], [[gerrit:1079925|Run fixLinkRecommendationData even when disabled in CC (T373176)]] (duration: 0 [14:09:31] 6m 48s) [14:09:37] T364341: [wmf.3] Special:Homepage - high rate for dangling db records - https://phabricator.wikimedia.org/T364341 [14:09:37] T372337: High number of dangling search index results at fr.wikipedia or it.wikipedia - https://phabricator.wikimedia.org/T372337 [14:09:37] T373176: fixLinkRecommendationData.php does not run when link-recommendation task type is disabled - https://phabricator.wikimedia.org/T373176 [14:10:02] !log [untruncated duration: 06m 48s] [14:10:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:06] !log UTC afternoon backport+config window done [14:12:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:26] (03CR) 10Brouberol: [C:03+2] cloudnative_pg: monitor daily rclone sync of PG S3 buckets [alerts] - 10https://gerrit.wikimedia.org/r/1080027 (https://phabricator.wikimedia.org/T377112) (owner: 10Brouberol) [14:12:27] Lucas_WMDE: Thank you! 🙏 [14:12:35] np :) [14:12:51] !log stevemunene@cumin1002 START - Cookbook sre.dns.netbox [14:16:12] !log stevemunene@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: an-worker1177.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - stevemunene@cumin1002" [14:16:33] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: an-worker1177.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - stevemunene@cumin1002" [14:16:33] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:16:34] !log stevemunene@cumin1002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts an-worker1177.eqiad.wmnet [14:18:35] (03CR) 10Gmodena: [C:03+1] Renamed log fields for pipeline migration (haproxykafka) [puppet] - 10https://gerrit.wikimedia.org/r/1074414 (https://phabricator.wikimedia.org/T370668) (owner: 10Fabfur) [14:18:40] (03PS1) 10JMeybohm: dragonfly::dfdaemon: Enable by default when profile is included [puppet] - 10https://gerrit.wikimedia.org/r/1080038 (https://phabricator.wikimedia.org/T362408) [14:21:46] (03PS1) 10Michael Große: eswiki: switch clearing link recommendations to PageSaveComplete hook [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080035 (https://phabricator.wikimedia.org/T372337) [14:21:46] (03CR) 10Michael Große: "I'm a bit unsure about whether we should do this in one change (like here) or in two changes where the first one only adds the default, an" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080035 (https://phabricator.wikimedia.org/T372337) (owner: 10Michael Große) [14:22:08] (03PS1) 10Brouberol: flink-operator: deply an image with fixes for recent OpenJDK vulnerability fixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080039 (https://phabricator.wikimedia.org/T371874) [14:23:44] (03CR) 10JMeybohm: [V:03+1] "PCC SUCCESS (NOOP 7): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4302/console" [puppet] - 10https://gerrit.wikimedia.org/r/1080038 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [14:29:13] (03CR) 10David Caro: [C:03+1] "LGTM, let's try" [puppet] - 10https://gerrit.wikimedia.org/r/1077038 (https://phabricator.wikimedia.org/T375479) (owner: 10FNegri) [14:30:00] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T376905)', diff saved to https://phabricator.wikimedia.org/P69812 and previous config saved to /var/cache/conftool/dbconfig/20241014-143000-ladsgroup.json [14:31:34] (03CR) 10Brouberol: [C:03+1] Remove the dumps_store_load_average icinga check [puppet] - 10https://gerrit.wikimedia.org/r/1079971 (https://phabricator.wikimedia.org/T374821) (owner: 10Btullis) [14:33:51] (03PS2) 10Brouberol: flink-operator: deply an image with fixes for recent OpenJDK vulns [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080039 (https://phabricator.wikimedia.org/T371874) [14:34:04] (03CR) 10Elukey: "Anything missing Janis? We are now fully row redundant :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080024 (https://phabricator.wikimedia.org/T333302) (owner: 10Elukey) [14:34:44] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10226204 (10phaultfinder) [14:35:57] (03PS1) 10JMeybohm: dragonfly::dfdaemon: Refactor docker integration [puppet] - 10https://gerrit.wikimedia.org/r/1080042 (https://phabricator.wikimedia.org/T362408) [14:37:13] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:37:26] (03CR) 10JMeybohm: [C:03+1] "I don't think there was anything else. 🚢" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080024 (https://phabricator.wikimedia.org/T333302) (owner: 10Elukey) [14:38:13] (03CR) 10JMeybohm: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1080042 (https://phabricator.wikimedia.org/T362408) (owner: 10JMeybohm) [14:39:02] (03PS1) 10Tiziano Fogli: logstash: stripping containerd prefix [puppet] - 10https://gerrit.wikimedia.org/r/1080047 (https://phabricator.wikimedia.org/T377132) [14:39:11] !log jayme@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [14:40:00] 06SRE, 06Infrastructure-Foundations, 06serviceops: Clean up the Docker Registry catalog and Swift storage from old images - https://phabricator.wikimedia.org/T375645#10226221 (10elukey) p:05Triage→03Medium [14:40:16] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10Spicerack: Upload redfish licenses to supermicro hosts - https://phabricator.wikimedia.org/T376121#10226222 (10elukey) p:05Triage→03Medium [14:40:57] (03CR) 10CI reject: [V:04-1] logstash: stripping containerd prefix [puppet] - 10https://gerrit.wikimedia.org/r/1080047 (https://phabricator.wikimedia.org/T377132) (owner: 10Tiziano Fogli) [14:41:08] !log jayme@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [14:41:39] !log jayme@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [14:42:38] (03PS2) 10Tiziano Fogli: logstash: stripping containerd prefix [puppet] - 10https://gerrit.wikimedia.org/r/1080047 (https://phabricator.wikimedia.org/T377132) [14:43:02] !log jayme@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [14:43:51] !log aikochou@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . [14:45:07] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P69813 and previous config saved to /var/cache/conftool/dbconfig/20241014-144507-ladsgroup.json [14:45:30] 10SRE-tools, 06Infrastructure-Foundations: debmonitor could provide users with cumin and/or debdeploy pre-made config/command - https://phabricator.wikimedia.org/T375475#10226275 (10joanna_borun) p:05Triage→03Low [14:45:36] 10SRE-tools, 06Infrastructure-Foundations: debmonitor could provide users with cumin and/or debdeploy pre-made config/command - https://phabricator.wikimedia.org/T375475#10226274 (10joanna_borun) @fgiunchedi could you please expand on the use-case and problem so we can figure out best way to address it? [14:46:50] 06SRE, 07SRE-Unowned, 06Infrastructure-Foundations: Create and deploy a re-reimplementation of irc.wikimedia.org in Python 3 without external service deps - https://phabricator.wikimedia.org/T376014#10226280 (10elukey) p:05Triage→03Medium [14:48:42] 06SRE, 06Infrastructure-Foundations: Phase out platform-engineering POSIX group - https://phabricator.wikimedia.org/T376808#10226299 (10elukey) p:05Triage→03Medium [14:50:11] (03CR) 10Majavah: "yeah, let's do that please while we still can. I'll send a patch." [puppet] - 10https://gerrit.wikimedia.org/r/1078986 (https://phabricator.wikimedia.org/T362066) (owner: 10David Caro) [14:50:14] (03PS3) 10Tiziano Fogli: logstash: stripping containerd prefix [puppet] - 10https://gerrit.wikimedia.org/r/1080047 (https://phabricator.wikimedia.org/T377132) [14:51:58] 06SRE-OnFire, 06Infrastructure-Foundations, 10netops, 10Sustainability (Incident Followup): Juniper: regularly run `request system configuration rescue save` - https://phabricator.wikimedia.org/T376005#10226302 (10joanna_borun) p:05Triage→03Low a:03ayounsi [14:55:45] (03PS1) 10Ilias Sarantopoulos: ml-services: bump kserve in langid to 0.13.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080052 (https://phabricator.wikimedia.org/T367048) [15:00:14] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P69814 and previous config saved to /var/cache/conftool/dbconfig/20241014-150014-ladsgroup.json [15:00:44] (03CR) 10Urbanecm: [C:03+1] "Both are equally acceptable! Personally, I believe it is wise to "pin" all/most variables in operations/mediawiki-config the moment you ad" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080035 (https://phabricator.wikimedia.org/T372337) (owner: 10Michael Große) [15:02:04] (03PS1) 10Majavah: P:toolforge::proxy: use svc.toolforge.org [puppet] - 10https://gerrit.wikimedia.org/r/1080056 [15:02:07] (03CR) 10AikoChou: [C:03+1] ml-services: bump kserve in langid to 0.13.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080052 (https://phabricator.wikimedia.org/T367048) (owner: 10Ilias Sarantopoulos) [15:02:13] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:03:02] (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: bump kserve in langid to 0.13.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080052 (https://phabricator.wikimedia.org/T367048) (owner: 10Ilias Sarantopoulos) [15:03:28] (03CR) 10Elukey: [C:03+2] Remove aux-k8s-etcd100[1,2] from production [puppet] - 10https://gerrit.wikimedia.org/r/1080022 (https://phabricator.wikimedia.org/T344230) (owner: 10Elukey) [15:03:59] (03CR) 10CI reject: [V:04-1] P:toolforge::proxy: use svc.toolforge.org [puppet] - 10https://gerrit.wikimedia.org/r/1080056 (owner: 10Majavah) [15:04:10] (03Merged) 10jenkins-bot: ml-services: bump kserve in langid to 0.13.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080052 (https://phabricator.wikimedia.org/T367048) (owner: 10Ilias Sarantopoulos) [15:04:20] (03CR) 10Elukey: [C:03+1] idp: add entry for requesctl.wikimedia.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1080034 (https://phabricator.wikimedia.org/T371782) (owner: 10Giuseppe Lavagetto) [15:04:37] (03PS2) 10Majavah: P:toolforge::proxy: use svc.toolforge.org [puppet] - 10https://gerrit.wikimedia.org/r/1080056 [15:04:48] (03CR) 10Elukey: [C:03+2] admin_ng: remove ad-hoc anti-affinity rules for Calico typha in AUX [deployment-charts] - 10https://gerrit.wikimedia.org/r/1080024 (https://phabricator.wikimedia.org/T333302) (owner: 10Elukey) [15:05:30] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . [15:06:07] !log elukey@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'sync'. [15:07:37] !log elukey@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'sync'. [15:15:21] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T376905)', diff saved to https://phabricator.wikimedia.org/P69815 and previous config saved to /var/cache/conftool/dbconfig/20241014-151521-ladsgroup.json [15:15:25] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1187.eqiad.wmnet with reason: Maintenance [15:15:39] !log isaranto@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . [15:15:39] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1187.eqiad.wmnet with reason: Maintenance [15:15:46] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1187 (T376905)', diff saved to https://phabricator.wikimedia.org/P69816 and previous config saved to /var/cache/conftool/dbconfig/20241014-151546-ladsgroup.json [15:16:02] !log isaranto@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . [15:19:24] FIRING: [2x] SystemdUnitFailed: mwscript-cleanup.service on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:23:56] (03CR) 10Volans: "forgot to git add them?" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1078658 (https://phabricator.wikimedia.org/T376712) (owner: 10Arnaudb) [15:30:04] jan_drewniak: Time to snap out of that daydream and deploy Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241014T1530). [15:33:28] (03CR) 10Volans: "Nice, couple of small improvements inline" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1080019 (https://phabricator.wikimedia.org/T377129) (owner: 10Arnaudb) [15:34:24] (03PS8) 10Arnaudb: mysql_legacy: double quote escape in run_query [software/spicerack] - 10https://gerrit.wikimedia.org/r/1078658 (https://phabricator.wikimedia.org/T376712) [15:36:16] (03CR) 10Volans: [C:03+1] "It seems ok, give it a test before/after merging." [cookbooks] - 10https://gerrit.wikimedia.org/r/1080012 (owner: 10Ayounsi) [15:39:13] (03PS2) 10JMeybohm: dragonfly::dfdaemon: Refactor docker integration [puppet] - 10https://gerrit.wikimedia.org/r/1080042 (https://phabricator.wikimedia.org/T362408) [15:43:15] (03CR) 10Arnaudb: "I removed them and forgot to write a new set, its fixed!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1078658 (https://phabricator.wikimedia.org/T376712) (owner: 10Arnaudb) [15:46:39] (03Abandoned) 10Elukey: role::docker_registry_ha::registry: add nginx monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1077933 (https://phabricator.wikimedia.org/T376285) (owner: 10Elukey) [15:46:41] !log aikochou@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . [15:46:52] (03Abandoned) 10Elukey: profile::trafficserver::backend: change timeouts for the docker registry [puppet] - 10https://gerrit.wikimedia.org/r/1075528 (https://phabricator.wikimedia.org/T242604) (owner: 10Elukey) [15:52:49] !log aikochou@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . [16:03:52] !log Running `sgimeno@mwmaint2002:~$ foreachwiki userOptions.php --delete --old=1 growthexperiments-tour-newimpact-discovery` (T376461) [16:03:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:13] T376461: Remove unused user property growthexperiments-tour-newimpact-discovery - https://phabricator.wikimedia.org/T376461 [16:06:40] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 10Ceph, 06DC-Ops: Disk (sdk) failed on moss-be1002 - https://phabricator.wikimedia.org/T377154 (10MatthewVernon) 03NEW [16:07:03] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 10Ceph, 06DC-Ops: Disk (sdk) failed on moss-be1002 - https://phabricator.wikimedia.org/T377154#10226646 (10MatthewVernon) p:05Triage→03Medium [16:16:02] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T376905)', diff saved to https://phabricator.wikimedia.org/P69817 and previous config saved to /var/cache/conftool/dbconfig/20241014-161602-ladsgroup.json [16:16:51] (03PS1) 10JMeybohm: containerd: Remove container log line length limit [puppet] - 10https://gerrit.wikimedia.org/r/1080071 (https://phabricator.wikimedia.org/T377132) [16:17:19] (03CR) 10Volans: mysql_legacy: double quote escape in run_query (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1078658 (https://phabricator.wikimedia.org/T376712) (owner: 10Arnaudb) [16:18:41] (03PS1) 10Ammarpad: ContactPage: Move nlwiki contactpage config to CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080072 (https://phabricator.wikimedia.org/T142544) [16:19:23] (03CR) 10CI reject: [V:04-1] ContactPage: Move nlwiki contactpage config to CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080072 (https://phabricator.wikimedia.org/T142544) (owner: 10Ammarpad) [16:23:29] (03PS2) 10Ammarpad: ContactPage: Move nlwiki contactpage config to CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080072 (https://phabricator.wikimedia.org/T142544) [16:31:09] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P69818 and previous config saved to /var/cache/conftool/dbconfig/20241014-163109-ladsgroup.json [16:36:03] (03CR) 10Ammarpad: "Test plan: Check these forms work" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080072 (https://phabricator.wikimedia.org/T142544) (owner: 10Ammarpad) [16:46:16] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P69819 and previous config saved to /var/cache/conftool/dbconfig/20241014-164616-ladsgroup.json [16:50:48] !log fnegri@cumin1002 START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on cloudvirt1063.eqiad.wmnet with reason: cloudvirt1063 needs maintenance T375223 [16:50:51] !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on cloudvirt1063.eqiad.wmnet with reason: cloudvirt1063 needs maintenance T375223 [16:51:29] T375223: 2024-09-21 NodeDown cloudvirt1063 - https://phabricator.wikimedia.org/T375223 [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241014T1700) [17:00:05] ryankemper: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241014T1700). [17:01:23] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T376905)', diff saved to https://phabricator.wikimedia.org/P69820 and previous config saved to /var/cache/conftool/dbconfig/20241014-170123-ladsgroup.json [17:01:28] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1225.eqiad.wmnet with reason: Maintenance [17:01:42] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1225.eqiad.wmnet with reason: Maintenance [17:03:36] 10ops-eqiad, 06SRE, 06DC-Ops, 10Wikimedia-Portals, 13Patch-For-Review: www.wikipedia.org: prefilling the search box with the "search" URL parameter does not work - https://phabricator.wikimedia.org/T318285#10226824 (10simon04) Likely a regression of https://gerrit.wikimedia.org/r/c/operations/puppet/+/81... [17:06:27] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1231.eqiad.wmnet with reason: Maintenance [17:06:41] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1231.eqiad.wmnet with reason: Maintenance [17:06:48] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1231 (T376905)', diff saved to https://phabricator.wikimedia.org/P69821 and previous config saved to /var/cache/conftool/dbconfig/20241014-170647-ladsgroup.json [17:12:25] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:48:38] (03CR) 10Ammarpad: "No default form should show for:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080072 (https://phabricator.wikimedia.org/T142544) (owner: 10Ammarpad) [18:00:57] (03PS7) 10Jcrespo: check footer legal complience: Add support for relative URLs [puppet] - 10https://gerrit.wikimedia.org/r/1079973 (https://phabricator.wikimedia.org/T375789) [18:01:50] (03CR) 10Jcrespo: [C:03+1] "Tests in this version look ok:" [puppet] - 10https://gerrit.wikimedia.org/r/1079973 (https://phabricator.wikimedia.org/T375789) (owner: 10Jcrespo) [18:07:05] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231 (T376905)', diff saved to https://phabricator.wikimedia.org/P69822 and previous config saved to /var/cache/conftool/dbconfig/20241014-180704-ladsgroup.json [18:22:12] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231', diff saved to https://phabricator.wikimedia.org/P69823 and previous config saved to /var/cache/conftool/dbconfig/20241014-182211-ladsgroup.json [18:37:19] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231', diff saved to https://phabricator.wikimedia.org/P69824 and previous config saved to /var/cache/conftool/dbconfig/20241014-183718-ladsgroup.json [18:39:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10226992 (10phaultfinder) [18:47:30] !log aqu@deploy2002 Started deploy [airflow-dags/analytics_test@a1a70ce]: Deploy last fixes on Refine staging [airflow-dags@a1a70ce8] [18:47:43] !log aqu@deploy2002 Finished deploy [airflow-dags/analytics_test@a1a70ce]: Deploy last fixes on Refine staging [airflow-dags@a1a70ce8] (duration: 00m 13s) [18:52:26] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231 (T376905)', diff saved to https://phabricator.wikimedia.org/P69825 and previous config saved to /var/cache/conftool/dbconfig/20241014-185225-ladsgroup.json [18:52:31] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance [18:52:44] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance [18:57:29] !log aqu@deploy2002 Started deploy [airflow-dags/analytics@a1a70ce]: Deploy last version for Refine staging [airflow-dags@a1a70ce8] [18:57:59] !log aqu@deploy2002 Finished deploy [airflow-dags/analytics@a1a70ce]: Deploy last version for Refine staging [airflow-dags@a1a70ce8] (duration: 00m 29s) [19:19:25] FIRING: [2x] SystemdUnitFailed: mwscript-cleanup.service on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:25:20] (03PS1) 10Pppery: Configure namespaces, sitenames, and timezones for new wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080078 [19:26:01] (03CR) 10CI reject: [V:04-1] Configure namespaces, sitenames, and timezones for new wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080078 (owner: 10Pppery) [19:27:03] (03PS2) 10Pppery: Configure namespaces, sitenames, and timezones for new wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080078 [19:27:11] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2209 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/1080079 (https://phabricator.wikimedia.org/T377164) [19:27:15] (03PS1) 10Gerrit maintenance bot: wmnet: Update s3-master alias [dns] - 10https://gerrit.wikimedia.org/r/1080080 (https://phabricator.wikimedia.org/T377164) [19:29:16] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance [19:29:30] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance [19:29:32] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [19:29:49] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [19:29:57] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1156 (T376905)', diff saved to https://phabricator.wikimedia.org/P69826 and previous config saved to /var/cache/conftool/dbconfig/20241014-192956-ladsgroup.json [19:30:18] (03PS3) 10Pppery: Configure namespaces, sitenames, and timezones for new wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080078 (https://phabricator.wikimedia.org/T377160) [19:33:54] (03PS4) 10Pppery: Configure namespaces, sitenames, and timezones for new wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080078 (https://phabricator.wikimedia.org/T377160) [19:39:19] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T376905)', diff saved to https://phabricator.wikimedia.org/P69827 and previous config saved to /var/cache/conftool/dbconfig/20241014-193918-ladsgroup.json [19:54:26] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P69828 and previous config saved to /var/cache/conftool/dbconfig/20241014-195425-ladsgroup.json [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor My software never has bugs. It just develops random features. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241014T2000). [20:00:05] Pppery: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:11] here [20:05:09] Anyone here to deploy? [20:08:43] I can :) [20:08:44] one moment [20:09:30] (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078122 (https://phabricator.wikimedia.org/T249648) (owner: 10Pppery) [20:09:33] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P69829 and previous config saved to /var/cache/conftool/dbconfig/20241014-200932-ladsgroup.json [20:10:13] (03Merged) 10jenkins-bot: Missing.php: Redirect Scots Wiktionary to Scots Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078122 (https://phabricator.wikimedia.org/T249648) (owner: 10Pppery) [20:10:31] !log samtar@deploy2002 Started scap sync-world: Backport for [[gerrit:1078122|Missing.php: Redirect Scots Wiktionary to Scots Wikipedia (T249648)]] [20:10:41] T249648: redirect sco.wiktionary.org/wiki/(.*?) -> sco.wikipedia.org/wiki/Define:$1 - https://phabricator.wikimedia.org/T249648 [20:12:49] !log samtar@deploy2002 samtar, pppery: Backport for [[gerrit:1078122|Missing.php: Redirect Scots Wiktionary to Scots Wikipedia (T249648)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:13:00] Pppery: on mwdebug for testing [20:13:04] On it [20:13:34] See to work as I want it to [20:13:38] (seems) [20:14:03] !log samtar@deploy2002 samtar, pppery: Continuing with sync [20:14:27] Although my testing did re-reveal an unrelated bug which I already submitted a patch for [20:15:43] Pppery: that change ^ was okay to get deployed though, right? [20:15:50] yes [20:16:24] since the bug happens regardless of whether or not it is deployed [20:16:33] ack :) [20:18:23] The bug being, by the way, that https://sco.wiktionary.org/wiki/w:Foo doesn't work - with the patch it goes to https://sco.wikipedia.org/wiki/Define:w:foo, and without the patch it goes to https://incubator.wikimedia.org/wiki/Wt/sco/w:foo, both of which are 404 pages hence neither is better than the other. The right destination would be [20:18:24] https://sco.wikipedia.org/wiki/Foo, and https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1075957 will fix it [20:18:46] !log samtar@deploy2002 Finished scap sync-world: Backport for [[gerrit:1078122|Missing.php: Redirect Scots Wiktionary to Scots Wikipedia (T249648)]] (duration: 08m 14s) [20:18:56] T249648: redirect sco.wiktionary.org/wiki/(.*?) -> sco.wikipedia.org/wiki/Define:$1 - https://phabricator.wikimedia.org/T249648 [20:19:55] deployed :) as for the follow-up, it'd be best to have a +1 rather than deploy it now, do you agree? [20:20:00] yep [20:20:10] which is why I haven't scheduled it for deployment [20:20:37] :) [20:21:32] 06SRE, 06Traffic-Icebox, 10Wikimedia-Apache-configuration, 13Patch-For-Review, 10Wiki-Setup (Delete / Redirect): redirect sco.wiktionary.org/wiki/(.*?) -> sco.wikipedia.org/wiki/Define:$1 - https://phabricator.wikimedia.org/T249648#10227207 (10Pppery) 05Open→03Resolved a:03Pppery [20:21:33] !log UTC late backport window done [20:21:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:40] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T376905)', diff saved to https://phabricator.wikimedia.org/P69830 and previous config saved to /var/cache/conftool/dbconfig/20241014-202439-ladsgroup.json [20:24:44] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1162.eqiad.wmnet with reason: Maintenance [20:24:57] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1162.eqiad.wmnet with reason: Maintenance [20:25:04] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1162 (T376905)', diff saved to https://phabricator.wikimedia.org/P69831 and previous config saved to /var/cache/conftool/dbconfig/20241014-202504-ladsgroup.json [20:25:28] TheresNoTime: check discord [20:34:17] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T376905)', diff saved to https://phabricator.wikimedia.org/P69832 and previous config saved to /var/cache/conftool/dbconfig/20241014-203416-ladsgroup.json [20:49:24] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P69833 and previous config saved to /var/cache/conftool/dbconfig/20241014-204923-ladsgroup.json [20:57:48] 06SRE-OnFire, 10Incident Tooling: corto: implement updating IRC topics and wikimediastatus.net - https://phabricator.wikimedia.org/T370785#10227246 (10lmata) >>! In T370785#10211164, @Eevans wrote: > Q: Should this be a part of the MVP (i.e. Day 1), or saved for a subsequent iteration? Having this in a later... [21:00:04] Reedy, sbassett, Maryum, and manfredi: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241014T2100). [21:04:31] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P69834 and previous config saved to /var/cache/conftool/dbconfig/20241014-210430-ladsgroup.json [21:11:30] FIRING: ProbeDown: Service wdqs1014:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1014:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:12:25] RESOLVED: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:16:30] RESOLVED: [2x] ProbeDown: Service wdqs1014:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1014:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:19:37] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T376905)', diff saved to https://phabricator.wikimedia.org/P69835 and previous config saved to /var/cache/conftool/dbconfig/20241014-211937-ladsgroup.json [21:19:41] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance [21:19:55] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance [21:20:02] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1182 (T376905)', diff saved to https://phabricator.wikimedia.org/P69836 and previous config saved to /var/cache/conftool/dbconfig/20241014-212001-ladsgroup.json [21:29:23] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T376905)', diff saved to https://phabricator.wikimedia.org/P69837 and previous config saved to /var/cache/conftool/dbconfig/20241014-212922-ladsgroup.json [21:34:54] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db2194 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P69838 and previous config saved to /var/cache/conftool/dbconfig/20241014-213453-ladsgroup.json [21:38:43] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 12:00:00 on db1226.eqiad.wmnet with reason: Maintenance [21:38:56] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 12:00:00 on db1226.eqiad.wmnet with reason: Maintenance [21:39:03] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1226 (T367856)', diff saved to https://phabricator.wikimedia.org/P69839 and previous config saved to /var/cache/conftool/dbconfig/20241014-213902-ladsgroup.json [21:39:06] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [21:44:30] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P69840 and previous config saved to /var/cache/conftool/dbconfig/20241014-214429-ladsgroup.json [21:45:06] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1236.eqiad.wmnet with reason: Maintenance [21:45:08] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1236.eqiad.wmnet with reason: Maintenance [21:45:16] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1236 (T371742)', diff saved to https://phabricator.wikimedia.org/P69841 and previous config saved to /var/cache/conftool/dbconfig/20241014-214515-ladsgroup.json [21:45:19] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [21:49:59] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db2194 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P69842 and previous config saved to /var/cache/conftool/dbconfig/20241014-214958-ladsgroup.json [21:51:44] (03PS2) 10Andrea Denisse: grafana: Ensure grafana-loki service auto restarts after system updates [puppet] - 10https://gerrit.wikimedia.org/r/1080090 (https://phabricator.wikimedia.org/T377166) [21:51:44] (03CR) 10Andrea Denisse: [V:03+1] "PCC results: https://puppet-compiler.wmflabs.org/output/1080090/4305/" [puppet] - 10https://gerrit.wikimedia.org/r/1080090 (https://phabricator.wikimedia.org/T377166) (owner: 10Andrea Denisse) [21:59:36] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P69843 and previous config saved to /var/cache/conftool/dbconfig/20241014-215936-ladsgroup.json [22:01:14] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1243.eqiad.wmnet with reason: Maintenance [22:01:28] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1243.eqiad.wmnet with reason: Maintenance [22:01:35] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1243 (T370903)', diff saved to https://phabricator.wikimedia.org/P69844 and previous config saved to /var/cache/conftool/dbconfig/20241014-220134-ladsgroup.json [22:01:39] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [22:05:04] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db2194 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P69845 and previous config saved to /var/cache/conftool/dbconfig/20241014-220504-ladsgroup.json [22:10:08] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1236 (T371742)', diff saved to https://phabricator.wikimedia.org/P69846 and previous config saved to /var/cache/conftool/dbconfig/20241014-221008-ladsgroup.json [22:10:12] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [22:14:43] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T376905)', diff saved to https://phabricator.wikimedia.org/P69847 and previous config saved to /var/cache/conftool/dbconfig/20241014-221443-ladsgroup.json [22:14:48] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance [22:15:01] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance [22:15:09] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1188 (T376905)', diff saved to https://phabricator.wikimedia.org/P69848 and previous config saved to /var/cache/conftool/dbconfig/20241014-221508-ladsgroup.json [22:20:10] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db2194 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P69849 and previous config saved to /var/cache/conftool/dbconfig/20241014-222009-ladsgroup.json [22:23:18] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T376905)', diff saved to https://phabricator.wikimedia.org/P69850 and previous config saved to /var/cache/conftool/dbconfig/20241014-222317-ladsgroup.json [22:25:15] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1236', diff saved to https://phabricator.wikimedia.org/P69851 and previous config saved to /var/cache/conftool/dbconfig/20241014-222515-ladsgroup.json [22:38:25] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P69852 and previous config saved to /var/cache/conftool/dbconfig/20241014-223824-ladsgroup.json [22:40:22] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1236', diff saved to https://phabricator.wikimedia.org/P69853 and previous config saved to /var/cache/conftool/dbconfig/20241014-224022-ladsgroup.json [22:43:12] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1243 (T370903)', diff saved to https://phabricator.wikimedia.org/P69854 and previous config saved to /var/cache/conftool/dbconfig/20241014-224311-ladsgroup.json [22:43:15] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [22:44:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10227342 (10phaultfinder) [22:50:48] (03CR) 10Krinkle: [C:03+1] Enable {{USERLANGUAGE}} on Commons and Meta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079680 (https://phabricator.wikimedia.org/T4085) (owner: 10Tim Starling) [22:52:52] (03CR) 10Krinkle: [C:03+1] Enable {{USERLANGUAGE}} on Commons and Meta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079680 (https://phabricator.wikimedia.org/T4085) (owner: 10Tim Starling) [22:53:32] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P69855 and previous config saved to /var/cache/conftool/dbconfig/20241014-225331-ladsgroup.json [22:55:29] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1236 (T371742)', diff saved to https://phabricator.wikimedia.org/P69856 and previous config saved to /var/cache/conftool/dbconfig/20241014-225528-ladsgroup.json [22:55:32] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [22:58:21] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1243', diff saved to https://phabricator.wikimedia.org/P69857 and previous config saved to /var/cache/conftool/dbconfig/20241014-225818-ladsgroup.json [22:59:42] FIRING: Device rebooted: Alert for device ps1-e3-eqiad.mgmt.eqiad.wmnet - Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [23:04:42] RESOLVED: Device rebooted: Device ps1-e3-eqiad.mgmt.eqiad.wmnet recovered from Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [23:08:38] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T376905)', diff saved to https://phabricator.wikimedia.org/P69858 and previous config saved to /var/cache/conftool/dbconfig/20241014-230838-ladsgroup.json [23:08:43] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance [23:08:56] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance [23:09:03] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1197 (T376905)', diff saved to https://phabricator.wikimedia.org/P69859 and previous config saved to /var/cache/conftool/dbconfig/20241014-230903-ladsgroup.json [23:09:31] FIRING: Device rebooted: Alert for device ps1-e2-eqiad.mgmt.eqiad.wmnet - Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [23:09:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10227354 (10phaultfinder) [23:13:28] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1243', diff saved to https://phabricator.wikimedia.org/P69860 and previous config saved to /var/cache/conftool/dbconfig/20241014-231328-ladsgroup.json [23:14:31] RESOLVED: Device rebooted: Device ps1-e2-eqiad.mgmt.eqiad.wmnet recovered from Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [23:17:16] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T376905)', diff saved to https://phabricator.wikimedia.org/P69861 and previous config saved to /var/cache/conftool/dbconfig/20241014-231715-ladsgroup.json [23:19:25] FIRING: [2x] SystemdUnitFailed: mwscript-cleanup.service on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:24:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10227362 (10phaultfinder) [23:28:35] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1243 (T370903)', diff saved to https://phabricator.wikimedia.org/P69862 and previous config saved to /var/cache/conftool/dbconfig/20241014-232835-ladsgroup.json [23:28:37] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2140.codfw.wmnet with reason: Maintenance [23:28:39] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [23:28:50] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2140.codfw.wmnet with reason: Maintenance [23:28:57] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2140 (T370903)', diff saved to https://phabricator.wikimedia.org/P69863 and previous config saved to /var/cache/conftool/dbconfig/20241014-232857-ladsgroup.json [23:32:23] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P69864 and previous config saved to /var/cache/conftool/dbconfig/20241014-233222-ladsgroup.json [23:38:30] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1080096 [23:38:30] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1080096 (owner: 10TrainBranchBot) [23:47:30] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P69865 and previous config saved to /var/cache/conftool/dbconfig/20241014-234729-ladsgroup.json [23:49:49] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10227369 (10phaultfinder) [23:53:42] (03CR) 10Tim Starling: "You know that's 49 wikis. It should probably be discussed somewhere." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079680 (https://phabricator.wikimedia.org/T4085) (owner: 10Tim Starling)