[00:00:40] FIRING: [14x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:02:35] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1059518 (owner: 10TrainBranchBot) [00:04:23] FIRING: [14x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:04:23] RESOLVED: JobUnavailable: Reduced availability for job netbox_django in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:05:40] FIRING: [14x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:09:23] RESOLVED: [14x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:15:40] FIRING: JobUnavailable: Reduced availability for job netbox_django in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:19:23] FIRING: [13x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:25:40] FIRING: [13x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:29:23] FIRING: [14x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:30:40] FIRING: [14x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:34:23] FIRING: [14x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:34:23] RESOLVED: JobUnavailable: Reduced availability for job netbox_django in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:39:23] RESOLVED: [14x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:39:28] FIRING: [4x] ProbeDown: Service netbox1003:443 has failed probes (http_netbox1003_eqiad_wmnet_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#netbox1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:44:23] RESOLVED: [4x] ProbeDown: Service netbox1003:443 has failed probes (http_netbox1003_eqiad_wmnet_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#netbox1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:44:23] FIRING: JobUnavailable: Reduced availability for job netbox_django in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:49:23] FIRING: [14x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:50:40] FIRING: [14x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:55:40] FIRING: [14x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:58:05] FIRING: KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [00:59:23] FIRING: [14x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:00:40] FIRING: [14x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:04:23] FIRING: [14x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:04:23] RESOLVED: JobUnavailable: Reduced availability for job netbox_django in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:05:40] RESOLVED: [14x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:14:23] FIRING: [14x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:14:23] FIRING: JobUnavailable: Reduced availability for job netbox_django in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:15:40] FIRING: [14x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:19:23] FIRING: [14x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:19:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:22:45] FIRING: [36x] ProbeDown: Service wdqs1012:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:24:23] FIRING: [14x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:25:40] FIRING: [14x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:29:23] FIRING: [14x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:30:40] FIRING: [14x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:34:23] FIRING: [14x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:34:23] RESOLVED: JobUnavailable: Reduced availability for job netbox_django in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:39:23] RESOLVED: [14x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:39:23] FIRING: [4x] ProbeDown: Service netbox1003:443 has failed probes (http_netbox1003_eqiad_wmnet_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#netbox1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:44:23] FIRING: [13x] SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:44:28] RESOLVED: [4x] ProbeDown: Service netbox1003:443 has failed probes (http_netbox1003_eqiad_wmnet_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#netbox1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:44:32] FIRING: JobUnavailable: Reduced availability for job netbox_django in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:45:40] FIRING: [13x] SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:54:23] FIRING: [13x] SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:55:40] FIRING: [13x] SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:05:40] FIRING: [13x] SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:05:40] RESOLVED: JobUnavailable: Reduced availability for job netbox_django in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:09:23] RESOLVED: [13x] SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:09:28] FIRING: [4x] ProbeDown: Service netbox1003:443 has failed probes (http_netbox1003_eqiad_wmnet_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#netbox1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:10:40] RESOLVED: [4x] ProbeDown: Service netbox1003:443 has failed probes (http_netbox1003_eqiad_wmnet_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#netbox1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:10:41] FIRING: JobUnavailable: Reduced availability for job netbox_django in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:14:23] FIRING: [13x] SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:15:40] FIRING: [13x] SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:19:23] FIRING: [13x] SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:19:25] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:24:23] FIRING: [13x] SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:25:40] FIRING: [13x] SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:29:23] FIRING: [13x] SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:30:40] FIRING: [13x] SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:34:23] FIRING: [13x] SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:34:23] RESOLVED: JobUnavailable: Reduced availability for job netbox_django in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:39:23] RESOLVED: [13x] SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:39:23] FIRING: [4x] ProbeDown: Service netbox1003:443 has failed probes (http_netbox1003_eqiad_wmnet_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#netbox1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:39:28] FIRING: [2x] JobUnavailable: Reduced availability for job netbox_django in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:44:23] RESOLVED: [4x] ProbeDown: Service netbox1003:443 has failed probes (http_netbox1003_eqiad_wmnet_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#netbox1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:44:23] FIRING: [2x] JobUnavailable: Reduced availability for job netbox_django in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:45:40] FIRING: [14x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:49:23] FIRING: [14x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:59:23] FIRING: [2x] JobUnavailable: Reduced availability for job netbox_django in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:04:23] FIRING: [14x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:04:23] RESOLVED: [2x] JobUnavailable: Reduced availability for job netbox_django in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:09:23] RESOLVED: [14x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:09:23] FIRING: [4x] ProbeDown: Service netbox1003:443 has failed probes (http_netbox1003_eqiad_wmnet_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#netbox1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:10:40] RESOLVED: [4x] ProbeDown: Service netbox1003:443 has failed probes (http_netbox1003_eqiad_wmnet_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#netbox1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:14:23] FIRING: [14x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:14:23] FIRING: [2x] JobUnavailable: Reduced availability for job netbox_django in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:15:40] FIRING: [14x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:19:23] FIRING: [14x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:25:40] FIRING: [14x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:29:23] FIRING: [14x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:30:40] FIRING: [14x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:34:23] FIRING: [14x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:35:40] FIRING: [14x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:35:40] RESOLVED: JobUnavailable: Reduced availability for job netbox_django in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:39:23] RESOLVED: [14x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:40:40] FIRING: [2x] ProbeDown: Service netbox1003:443 has failed probes (http_netbox1003_eqiad_wmnet_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#netbox1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:44:23] RESOLVED: [4x] ProbeDown: Service netbox1003:443 has failed probes (http_netbox1003_eqiad_wmnet_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#netbox1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:44:23] FIRING: JobUnavailable: Reduced availability for job netbox_django in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:49:23] FIRING: [14x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:50:40] FIRING: [14x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:54:23] FIRING: [14x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:55:40] FIRING: [14x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:59:23] FIRING: [14x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:00:40] FIRING: [14x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:04:23] FIRING: [14x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:04:23] RESOLVED: JobUnavailable: Reduced availability for job netbox_django in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:09:23] RESOLVED: [13x] SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:09:23] FIRING: [4x] ProbeDown: Service netbox1003:443 has failed probes (http_netbox1003_eqiad_wmnet_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#netbox1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:14:23] FIRING: [13x] SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:14:23] RESOLVED: [4x] ProbeDown: Service netbox1003:443 has failed probes (http_netbox1003_eqiad_wmnet_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#netbox1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:14:27] FIRING: JobUnavailable: Reduced availability for job netbox_django in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:15:40] FIRING: [13x] SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:19:23] FIRING: [13x] SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:24:23] FIRING: [14x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:25:40] FIRING: [14x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:30:40] FIRING: [14x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:34:23] FIRING: [14x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:34:23] RESOLVED: JobUnavailable: Reduced availability for job netbox_django in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:39:23] RESOLVED: [14x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:39:23] FIRING: [4x] ProbeDown: Service netbox1003:443 has failed probes (http_netbox1003_eqiad_wmnet_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#netbox1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:44:23] FIRING: JobUnavailable: Reduced availability for job netbox_django in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:45:40] RESOLVED: [4x] ProbeDown: Service netbox1003:443 has failed probes (http_netbox1003_eqiad_wmnet_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#netbox1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:49:23] FIRING: [14x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:50:40] FIRING: [14x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:55:40] FIRING: [14x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:58:05] FIRING: KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [04:59:23] FIRING: [14x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:00:40] FIRING: [14x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:04:23] FIRING: [14x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:04:23] RESOLVED: JobUnavailable: Reduced availability for job netbox_django in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:05:40] FIRING: [14x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:09:23] RESOLVED: [14x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:10:41] FIRING: JobUnavailable: Reduced availability for job netbox_django in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:14:23] FIRING: [14x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:15:40] FIRING: [14x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:19:23] FIRING: [14x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:22:46] FIRING: [36x] ProbeDown: Service wdqs1012:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:25:40] FIRING: [14x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:29:23] FIRING: [14x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:30:40] FIRING: [14x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:34:23] FIRING: [14x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:34:23] RESOLVED: JobUnavailable: Reduced availability for job netbox_django in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:35:40] RESOLVED: [14x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:44:23] FIRING: [14x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:44:23] FIRING: JobUnavailable: Reduced availability for job netbox_django in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:45:40] FIRING: [14x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:49:23] RESOLVED: [14x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:49:23] RESOLVED: JobUnavailable: Reduced availability for job netbox_django in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:54:23] FIRING: JobUnavailable: Reduced availability for job netbox_django in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:55:41] RESOLVED: JobUnavailable: Reduced availability for job netbox_django in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:59:23] FIRING: [4x] ProbeDown: Service netbox1003:443 has failed probes (http_netbox1003_eqiad_wmnet_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#netbox1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:00:40] RESOLVED: [4x] ProbeDown: Service netbox1003:443 has failed probes (http_netbox1003_eqiad_wmnet_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#netbox1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:00:41] FIRING: JobUnavailable: Reduced availability for job netbox_django in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:03:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:04:23] RESOLVED: JobUnavailable: Reduced availability for job netbox_django in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:04:49] !log reboot netbox1003 [06:08:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:14:23] FIRING: JobUnavailable: Reduced availability for job netbox_django in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:17:16] !log bump netbox1003 memory to 6G [06:19:16] !log ayounsi@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM netbox1003.eqiad.wmnet [06:19:23] FIRING: [10x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:20:41] RESOLVED: JobUnavailable: Reduced availability for job netbox_django in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:25:35] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM netbox1003.eqiad.wmnet [06:25:40] RESOLVED: [11x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:26:06] (03PS2) 10KartikMistry: Update MinT to 2024-08-05-062247-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1057421 (https://phabricator.wikimedia.org/T363308) [06:29:10] If there is no config/backport patches in the next window, I would like to deploy MinT/cxserver. [06:40:23] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users for Cynthia Makonyango WMDE - https://phabricator.wikimedia.org/T371689#10041144 (10WMDECyn) Yes I do. [06:51:33] (03PS1) 10Slyngshede: data.yaml: Offboarding of mcastro [puppet] - 10https://gerrit.wikimedia.org/r/1059743 [06:55:39] !log push `LVS-service-ips` rename to ssw1-d8-codfw [06:55:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:05] Amir1 and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240805T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:10:31] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users for Joely Rooke WMDE - https://phabricator.wikimedia.org/T371584#10041171 (10JoelyRooke-WMDE) Yes I will need access to private data, just not ssh key entry [07:12:32] I'll deploy MinT/cxserver after some time. [07:14:49] (03PS1) 10KartikMistry: Update cxserver to 2024-08-05-063332-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059746 (https://phabricator.wikimedia.org/T371760) [07:19:02] (03CR) 10Filippo Giunchedi: [C:03+1] jaeger: enable archive support in query and ui [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059390 (https://phabricator.wikimedia.org/T371390) (owner: 10CDanis) [07:19:52] (03CR) 10Filippo Giunchedi: [C:03+2] o11y: higher thresholds for webrequest-live benthos kafka lag alert [alerts] - 10https://gerrit.wikimedia.org/r/1059304 (owner: 10Filippo Giunchedi) [07:21:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-wikifunctions (k8s) 5.769s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [07:22:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 0% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [07:27:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 0% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [07:27:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:31:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-wikifunctions (k8s) 6.944s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [07:31:59] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3535/co" [puppet] - 10https://gerrit.wikimedia.org/r/1059416 (https://phabricator.wikimedia.org/T365259) (owner: 10Dzahn) [07:35:14] (03CR) 10Jelto: [V:03+1] "comment in-line" [puppet] - 10https://gerrit.wikimedia.org/r/1059416 (https://phabricator.wikimedia.org/T365259) (owner: 10Dzahn) [07:36:54] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3536/console" [puppet] - 10https://gerrit.wikimedia.org/r/1059417 (https://phabricator.wikimedia.org/T365259) (owner: 10Dzahn) [07:40:07] (03CR) 10Filippo Giunchedi: [C:03+1] haproxy: remove template switch for benthos extended logging [puppet] - 10https://gerrit.wikimedia.org/r/1059358 (https://phabricator.wikimedia.org/T370741) (owner: 10Fabfur) [07:41:04] (03CR) 10Jelto: [V:03+1] "comment in-line" [puppet] - 10https://gerrit.wikimedia.org/r/1059417 (https://phabricator.wikimedia.org/T365259) (owner: 10Dzahn) [07:41:34] (03CR) 10Filippo Giunchedi: "SGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1059265 (https://phabricator.wikimedia.org/T371492) (owner: 10Filippo Giunchedi) [07:56:11] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde, ldap/nda for Seanleong-WMDE - https://phabricator.wikimedia.org/T371694#10041263 (10SLyngshede-WMF) p:05Triageโ†’03Medium @KFrancis would you confirm that we have an NDA for @seanleong-WMDE [08:01:14] (03PS2) 10Jelto: gerrit: set nft throttling policy to drop, only on replica host [puppet] - 10https://gerrit.wikimedia.org/r/1059417 (https://phabricator.wikimedia.org/T365259) (owner: 10Dzahn) [08:01:52] !log zabe@mwmaint1002:~$ mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=loginwiki --logwiki=metawiki "It'sMogli" 'ItsMogli' # T371784 [08:01:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:55] T371784: Unblock stuck global renames of Ligg89 and ItsMogli - https://phabricator.wikimedia.org/T371784 [08:02:55] !log zabe@mwmaint1002:~$ mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=loginwiki --logwiki=metawiki 'Lirielmartinss' 'Ligg89' # T371784 [08:02:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:51] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1059417 (https://phabricator.wikimedia.org/T365259) (owner: 10Dzahn) [08:05:03] (03CR) 10Jelto: [V:03+1 C:03+1] "lgtm now!" [puppet] - 10https://gerrit.wikimedia.org/r/1059417 (https://phabricator.wikimedia.org/T365259) (owner: 10Dzahn) [08:07:14] (03CR) 10TrainBranchBot: [C:03+2] "Approved by zabe@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059402 (owner: 10Zabe) [08:07:51] (03Merged) 10jenkins-bot: noc: Provide db-sections.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059402 (owner: 10Zabe) [08:08:14] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1059402|noc: Provide db-sections.php]] [08:09:14] (03PS3) 10Jelto: gerrit: enable nft throttling on role level, but just log [puppet] - 10https://gerrit.wikimedia.org/r/1059416 (https://phabricator.wikimedia.org/T365259) (owner: 10Dzahn) [08:11:10] !log Restarted MediaModeration scanning script - https://wikitech.wikimedia.org/wiki/MediaModeration [08:11:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:53] !log ayounsi@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM netbox1003.eqiad.wmnet [08:14:19] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3538/co" [puppet] - 10https://gerrit.wikimedia.org/r/1059416 (https://phabricator.wikimedia.org/T365259) (owner: 10Dzahn) [08:16:14] (03CR) 10Jelto: [V:03+1 C:03+1] "lgtm now" [puppet] - 10https://gerrit.wikimedia.org/r/1059416 (https://phabricator.wikimedia.org/T365259) (owner: 10Dzahn) [08:17:56] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM netbox1003.eqiad.wmnet [08:20:50] !log manually removing wmf_auto_restart_benthos@haproxy_cache.service on cp4037 - T370741 [08:20:54] !log zabe@deploy1003 zabe: Backport for [[gerrit:1059402|noc: Provide db-sections.php]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:20:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:57] !log zabe@deploy1003 zabe: Continuing with sync [08:20:57] T370741: Remove Benthos from ulsfo hosts - https://phabricator.wikimedia.org/T370741 [08:21:38] (03PS6) 10Ayounsi: check_netbox_report.py: reports -> scripts [puppet] - 10https://gerrit.wikimedia.org/r/1059042 [08:21:38] (03PS3) 10Ayounsi: Netbox add libpq-dev package [puppet] - 10https://gerrit.wikimedia.org/r/1059099 [08:21:38] (03PS3) 10Ayounsi: Netbox: enable netbox_more_metrics plugin [puppet] - 10https://gerrit.wikimedia.org/r/1059363 (https://phabricator.wikimedia.org/T311052) [08:21:39] (03PS1) 10Ayounsi: Postgres prom exporter: ignore extended queries on >= bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1059834 [08:22:29] (03CR) 10CI reject: [V:04-1] Postgres prom exporter: ignore extended queries on >= bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1059834 (owner: 10Ayounsi) [08:27:25] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:28:00] !log ayounsi@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM netbox2003.codfw.wmnet [08:28:45] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM netbox2003.codfw.wmnet [08:30:19] !log zabe@deploy1003 Finished scap: Backport for [[gerrit:1059402|noc: Provide db-sections.php]] (duration: 22m 04s) [08:42:13] (03PS2) 10Ayounsi: Postgres prom exporter: ignore queries.yaml on >= bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1059834 [08:43:23] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1059834 (owner: 10Ayounsi) [08:44:54] (03CR) 10Ayounsi: "Before:" [puppet] - 10https://gerrit.wikimedia.org/r/1059834 (owner: 10Ayounsi) [08:49:21] (03CR) 10Ayounsi: [V:03+1] "PCC seems happy." [puppet] - 10https://gerrit.wikimedia.org/r/1059834 (owner: 10Ayounsi) [08:58:05] FIRING: KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [08:59:15] 06SRE, 06Growth-Team, 10observability, 10StructuredDiscussions, 10Wikimedia-Logstash: Flow internal error on frwiki not in logstash - https://phabricator.wikimedia.org/T371586#10041368 (10fgiunchedi) @Urbanecm_WMF I looked for the exception ID on mwlog and wasn't able to find it there either, were you or... [08:59:28] (03CR) 10Vgutierrez: [C:03+1] Exclude some requests from concurrency tracking [puppet] - 10https://gerrit.wikimedia.org/r/1059126 (https://phabricator.wikimedia.org/T368389) (owner: 10CDanis) [09:00:38] (03CR) 10Ayounsi: [C:03+2] Netbox add libpq-dev package [puppet] - 10https://gerrit.wikimedia.org/r/1059099 (owner: 10Ayounsi) [09:01:24] (03PS4) 10Ayounsi: Netbox add libpq-dev package [puppet] - 10https://gerrit.wikimedia.org/r/1059099 [09:03:40] (03CR) 10Ayounsi: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1059099 (owner: 10Ayounsi) [09:05:36] (03PS1) 10Vgutierrez: hiera: exclude wikimedia_trust from url bwlim [puppet] - 10https://gerrit.wikimedia.org/r/1059837 (https://phabricator.wikimedia.org/T317799) [09:06:16] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1059837 (https://phabricator.wikimedia.org/T317799) (owner: 10Vgutierrez) [09:09:59] (03CR) 10Ayounsi: [C:03+2] Netbox: enable netbox_more_metrics plugin [puppet] - 10https://gerrit.wikimedia.org/r/1059363 (https://phabricator.wikimedia.org/T311052) (owner: 10Ayounsi) [09:10:53] (03PS4) 10Ayounsi: Netbox: enable netbox_more_metrics plugin [puppet] - 10https://gerrit.wikimedia.org/r/1059363 (https://phabricator.wikimedia.org/T311052) [09:10:54] (03CR) 10David Caro: [C:03+2] "I have pending doing a reimage of those hosts to properly test, but I'll merge this for now (tested right now by manually removing the pac" [puppet] - 10https://gerrit.wikimedia.org/r/1035412 (owner: 10David Caro) [09:11:38] (03CR) 10Ayounsi: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1059363 (https://phabricator.wikimedia.org/T311052) (owner: 10Ayounsi) [09:16:10] !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host clouddumps1001.wikimedia.org [09:17:08] (03PS1) 10Jelto: Revert "phabricator: delay pages my 30 minutes to reduce alerting noise" [puppet] - 10https://gerrit.wikimedia.org/r/1059840 (https://phabricator.wikimedia.org/T371418) [09:19:28] 06SRE, 06Growth-Team, 10observability, 10StructuredDiscussions, 10Wikimedia-Logstash: Flow internal error on frwiki not in logstash - https://phabricator.wikimedia.org/T371586#10041411 (10Urbanecm_WMF) >>! In T371586#10041368, @fgiunchedi wrote: > @Urbanecm_WMF I looked for the exception ID on mwlog and... [09:20:06] (03PS1) 10Btullis: Fix error in site.pp for new analytics zookeeper hosts [puppet] - 10https://gerrit.wikimedia.org/r/1059842 (https://phabricator.wikimedia.org/T364429) [09:22:46] FIRING: [36x] ProbeDown: Service wdqs1012:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:24:13] (03CR) 10Btullis: [C:03+2] Fix error in site.pp for new analytics zookeeper hosts [puppet] - 10https://gerrit.wikimedia.org/r/1059842 (https://phabricator.wikimedia.org/T364429) (owner: 10Btullis) [09:24:44] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host clouddumps1001.wikimedia.org [09:26:56] (03CR) 10Vgutierrez: haproxy: add confd files for ipblock maps (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1059457 (https://phabricator.wikimedia.org/T370745) (owner: 10Giuseppe Lavagetto) [09:27:04] !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host clouddumps1002.wikimedia.org [09:27:47] (03CR) 10Vgutierrez: cache: test requestctl rules in haproxy on cp4044 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1059459 (https://phabricator.wikimedia.org/T370745) (owner: 10Giuseppe Lavagetto) [09:32:02] (03PS1) 10Ayounsi: Netbox3: disable crons and remove from netbox cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/1059846 [09:35:00] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1059846 (owner: 10Ayounsi) [09:35:18] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host clouddumps1002.wikimedia.org [09:35:30] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host an-conf1004.eqiad.wmnet with OS bookworm [09:35:39] 10ops-eqiad, 06SRE, 06Data-Engineering, 06DC-Ops: Q4:rack/setup/install an-conf100[4-6] - https://phabricator.wikimedia.org/T364429#10041467 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host an-conf1004.eqiad.wmnet with OS bookworm [09:35:41] FIRING: JobUnavailable: Reduced availability for job probes/swagger in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:35:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://en.wikipedia.org/api/rest_v1 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqsin - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [09:36:01] !log klausman@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [09:38:24] !log klausman@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [09:38:36] (03PS2) 10Ayounsi: Netbox3: disable crons and remove from netbox cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/1059846 [09:38:42] !log klausman@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [09:39:00] !log klausman@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [09:39:06] (03PS3) 10Ayounsi: Netbox3: disable crons and remove from netbox cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/1059846 [09:39:32] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1059846 (owner: 10Ayounsi) [09:40:41] FIRING: [2x] JobUnavailable: Reduced availability for job netbox_django in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:40:52] !log klausman@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [09:44:00] !log klausman@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [09:44:23] FIRING: [2x] JobUnavailable: Reduced availability for job netbox_django in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:45:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://en.wikipedia.org/api/rest_v1 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqsin - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [09:48:00] !log klausman@deploy1003 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [09:48:27] !log klausman@deploy1003 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [09:52:51] !log klausman@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [09:54:23] FIRING: [3x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240805T1000) [10:03:12] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review: Move the private Puppet repository to puppetserver1001 - https://phabricator.wikimedia.org/T368023#10041581 (10elukey) I managed to repro the issue on a local docker container, and I can say that it is definitely the code of... [10:04:37] (03PS1) 10Slyngshede: 2FA: Implement recovery codes. [software/bitu] - 10https://gerrit.wikimedia.org/r/1059850 [10:05:21] (03CR) 10Giuseppe Lavagetto: haproxy: add confd files for ipblock maps (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1059457 (https://phabricator.wikimedia.org/T370745) (owner: 10Giuseppe Lavagetto) [10:06:10] 06SRE, 06Growth-Team, 10observability, 10StructuredDiscussions, 10Wikimedia-Logstash: Flow internal error on frwiki not in logstash - https://phabricator.wikimedia.org/T371586#10041601 (10fgiunchedi) >>! In T371586#10041411, @Urbanecm_WMF wrote: >>>! In T371586#10041368, @fgiunchedi wrote: >> @Urbanecm_W... [10:07:00] (03PS3) 10Giuseppe Lavagetto: haproxy: add confd files for ipblock maps [puppet] - 10https://gerrit.wikimedia.org/r/1059457 (https://phabricator.wikimedia.org/T370745) [10:07:00] (03PS2) 10Giuseppe Lavagetto: cache: test requestctl rules in haproxy on cp4044 [puppet] - 10https://gerrit.wikimedia.org/r/1059459 (https://phabricator.wikimedia.org/T370745) [10:07:02] (03CR) 10Giuseppe Lavagetto: cache: test requestctl rules in haproxy on cp4044 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1059459 (https://phabricator.wikimedia.org/T370745) (owner: 10Giuseppe Lavagetto) [10:11:22] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059851 [10:12:02] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059852 [10:15:32] (03CR) 10Elukey: "In profile::netbox I see this:" [puppet] - 10https://gerrit.wikimedia.org/r/1059846 (owner: 10Ayounsi) [10:16:17] (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059852 (owner: 10PipelineBot) [10:16:21] (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059851 (owner: 10PipelineBot) [10:16:25] (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1057918 (owner: 10PipelineBot) [10:16:29] (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1055961 (owner: 10PipelineBot) [10:16:32] (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1054867 (owner: 10PipelineBot) [10:16:37] (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051716 (owner: 10PipelineBot) [10:16:42] (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1051129 (owner: 10PipelineBot) [10:16:46] (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1049506 (owner: 10PipelineBot) [10:19:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T367856)', diff saved to https://phabricator.wikimedia.org/P67218 and previous config saved to /var/cache/conftool/dbconfig/20240805-101930-marostegui.json [10:19:33] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [10:19:47] (03PS2) 10Slyngshede: 2FA: Implement recovery codes. [software/bitu] - 10https://gerrit.wikimedia.org/r/1059850 [10:22:23] !log mfossati@deploy1003 Started deploy [airflow-dags/platform_eng@537b288]: (no justification provided) [10:22:59] !log mfossati@deploy1003 Finished deploy [airflow-dags/platform_eng@537b288]: (no justification provided) (duration: 00m 36s) [10:24:51] !log btullis@cumin1002 START - Cookbook sre.hosts.decommission for hosts clouddb1021.eqiad.wmnet [10:27:45] (03PS4) 10Ayounsi: Netbox3: disable crons and remove from netbox cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/1059846 [10:28:09] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1059846 (owner: 10Ayounsi) [10:30:50] !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host dse-k8s-ctrl1001.eqiad.wmnet [10:31:08] !log btullis@cumin1002 START - Cookbook sre.dns.netbox [10:31:09] (03CR) 10Ayounsi: "Good point ! much better. Done." [puppet] - 10https://gerrit.wikimedia.org/r/1059846 (owner: 10Ayounsi) [10:34:02] !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-test-master1002.eqiad.wmnet [10:34:25] (03CR) 10Elukey: [C:03+1] Netbox3: disable crons and remove from netbox cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/1059846 (owner: 10Ayounsi) [10:34:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P67219 and previous config saved to /var/cache/conftool/dbconfig/20240805-103437-marostegui.json [10:35:33] !log btullis@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: clouddb1021.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - btullis@cumin1002" [10:36:39] !log btullis@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: clouddb1021.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - btullis@cumin1002" [10:36:39] !log btullis@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:36:39] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts clouddb1021.eqiad.wmnet [10:37:35] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-ctrl1001.eqiad.wmnet [10:39:25] (03PS2) 10Btullis: [WIP] Remove references to clouddb1021 once the host has been decommissioned [puppet] - 10https://gerrit.wikimedia.org/r/1048390 (https://phabricator.wikimedia.org/T365453) [10:40:16] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-test-master1002.eqiad.wmnet [10:41:58] (03PS3) 10Btullis: Remove references to clouddb1021 once the host has been decommissioned [puppet] - 10https://gerrit.wikimedia.org/r/1048390 (https://phabricator.wikimedia.org/T365453) [10:43:37] !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-test-master1001.eqiad.wmnet [10:49:06] !log btullis@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-conf1004.eqiad.wmnet with OS bookworm [10:49:15] 10ops-eqiad, 06SRE, 06Data-Engineering, 06DC-Ops: Q4:rack/setup/install an-conf100[4-6] - https://phabricator.wikimedia.org/T364429#10041698 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host an-conf1004.eqiad.wmnet with OS bookworm executed with errors: - an-... [10:49:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P67220 and previous config saved to /var/cache/conftool/dbconfig/20240805-104943-marostegui.json [10:49:46] (03PS4) 10Slyngshede: Implement 2FA support [software/bitu] - 10https://gerrit.wikimedia.org/r/1057862 [10:49:50] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-test-master1001.eqiad.wmnet [10:50:14] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host an-conf1004.eqiad.wmnet with OS bookworm [10:50:22] 10ops-eqiad, 06SRE, 06Data-Engineering, 06DC-Ops: Q4:rack/setup/install an-conf100[4-6] - https://phabricator.wikimedia.org/T364429#10041699 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host an-conf1004.eqiad.wmnet with OS bookworm [10:53:10] !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host snapshot1014.eqiad.wmnet [10:57:44] 06SRE, 10SRE-Access-Requests: Requesting access to for - https://phabricator.wikimedia.org/T371796 (10Ifeatu_Nnaobi_WMDE) 03NEW [10:59:08] !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host dse-k8s-ctrl1002.eqiad.wmnet [11:00:01] 06SRE, 06Growth-Team, 10observability, 10StructuredDiscussions, 10Wikimedia-Logstash: Flow internal error on frwiki not in logstash - https://phabricator.wikimedia.org/T371586#10041713 (10DMburugu) @Urbanecm_WMF Can we tag editing on this so that they are aware as well? We can facilitate a fix and it wou... [11:00:12] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host snapshot1014.eqiad.wmnet [11:00:15] !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host snapshot1015.eqiad.wmnet [11:00:44] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-conf1004.eqiad.wmnet with reason: host reimage [11:01:45] 06SRE, 10SRE-Access-Requests: Requesting access to for - https://phabricator.wikimedia.org/T371796#10041729 (10WMDECyn) Request approved [11:03:09] (03CR) 10Ayounsi: [C:03+2] Netbox3: disable crons and remove from netbox cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/1059846 (owner: 10Ayounsi) [11:03:28] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-conf1004.eqiad.wmnet with reason: host reimage [11:04:05] !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-test-client1002.eqiad.wmnet [11:04:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T367856)', diff saved to https://phabricator.wikimedia.org/P67221 and previous config saved to /var/cache/conftool/dbconfig/20240805-110450-marostegui.json [11:04:53] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 6:00:00 on db1207.eqiad.wmnet with reason: Maintenance [11:04:54] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [11:05:05] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 6:00:00 on db1207.eqiad.wmnet with reason: Maintenance [11:05:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1207 (T367856)', diff saved to https://phabricator.wikimedia.org/P67222 and previous config saved to /var/cache/conftool/dbconfig/20240805-110512-marostegui.json [11:05:29] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-ctrl1002.eqiad.wmnet [11:05:43] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review: Move the private Puppet repository to puppetserver1001 - https://phabricator.wikimedia.org/T368023#10041744 (10elukey) After a brainbounce with Joe on the SRE IRC channel, we noticed that the environment variables when runnin... [11:06:14] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host snapshot1015.eqiad.wmnet [11:06:17] !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host snapshot1016.eqiad.wmnet [11:09:11] (03PS1) 10Btullis: Remove remaining references to cloudb1021 [puppet] - 10https://gerrit.wikimedia.org/r/1059854 (https://phabricator.wikimedia.org/T368518) [11:10:44] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-test-client1002.eqiad.wmnet [11:11:51] !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-test-coord1001.eqiad.wmnet [11:12:15] !log ayounsi@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary and A:netbox-all [11:12:40] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host snapshot1016.eqiad.wmnet [11:16:52] !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host dumpsdata1004.eqiad.wmnet [11:17:45] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-conf1004.eqiad.wmnet with OS bookworm [11:17:52] 10ops-eqiad, 06SRE, 06Data-Engineering, 06DC-Ops: Q4:rack/setup/install an-conf100[4-6] - https://phabricator.wikimedia.org/T364429#10041750 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host an-conf1004.eqiad.wmnet with OS bookworm completed: - an-conf1004 (*... [11:18:52] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary and A:netbox-all [11:20:27] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-test-coord1001.eqiad.wmnet [11:20:58] 06SRE, 06Editing-team, 06Growth-Team, 10MediaWiki-Debug-Logger, and 3 others: Flow internal error on frwiki not in logstash - https://phabricator.wikimedia.org/T371586#10041751 (10Urbanecm_WMF) >>! In T371586#10041601, @fgiunchedi wrote: > Ok thank you, it seems to me that mw didn't log the exception as su... [11:22:46] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dumpsdata1004.eqiad.wmnet [11:24:08] 06SRE, 06Editing-team, 06Growth-Team, 10MediaWiki-Debug-Logger, and 3 others: Flow internal error on frwiki not in logstash - https://phabricator.wikimedia.org/T371586#10041755 (10Urbanecm_WMF) >>! In T371586#10041712, @DMburugu wrote: > @Urbanecm_WMF Can we tag editing on this so that they are aware as we... [11:27:34] !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host dumpsdata1005.eqiad.wmnet [11:31:11] FIRING: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:33:50] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dumpsdata1005.eqiad.wmnet [11:34:45] !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host dumpsdata1007.eqiad.wmnet [11:36:11] RESOLVED: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:42:22] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host an-conf1005.eqiad.wmnet with OS bookworm [11:42:29] 10ops-eqiad, 06SRE, 06Data-Engineering, 06DC-Ops: Q4:rack/setup/install an-conf100[4-6] - https://phabricator.wikimedia.org/T364429#10041784 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host an-conf1005.eqiad.wmnet with OS bookworm [11:42:33] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dumpsdata1007.eqiad.wmnet [11:44:53] 06SRE, 10SRE-Access-Requests: Requesting access to for - https://phabricator.wikimedia.org/T371796#10041787 (10Aklapper) @Ifeatu_Nnaobi_WMDE Hi, why is this task assigned to Fabfur? [11:46:13] 06SRE, 10SRE-Access-Requests: Requesting access to for ifeatu_nnaobi_wmde - https://phabricator.wikimedia.org/T371796#10041788 (10Aklapper) [11:53:00] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-conf1005.eqiad.wmnet with reason: host reimage [11:56:38] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-conf1005.eqiad.wmnet with reason: host reimage [12:03:44] (03PS1) 10Jelto: add GeekyWorks to external_clouds_vendors_nets [puppet] - 10https://gerrit.wikimedia.org/r/1059867 (https://phabricator.wikimedia.org/T371418) [12:11:27] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-conf1005.eqiad.wmnet with OS bookworm [12:11:39] 10ops-eqiad, 06SRE, 06Data-Engineering, 06DC-Ops: Q4:rack/setup/install an-conf100[4-6] - https://phabricator.wikimedia.org/T364429#10041850 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host an-conf1005.eqiad.wmnet with OS bookworm completed: - an-conf1005 (*... [12:24:53] 10ops-codfw, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install prometheus200[78] - https://phabricator.wikimedia.org/T370429#10041885 (10fgiunchedi) Sweet, thank you @Papaul and @Jhancock.wm ! [12:31:47] (03PS1) 10Filippo Giunchedi: install_server: shrink default / /srv swap Prometheus space [puppet] - 10https://gerrit.wikimedia.org/r/1059879 (https://phabricator.wikimedia.org/T370772) [12:41:28] (03CR) 10Filippo Giunchedi: [C:03+2] install_server: shrink default / /srv swap Prometheus space [puppet] - 10https://gerrit.wikimedia.org/r/1059879 (https://phabricator.wikimedia.org/T370772) (owner: 10Filippo Giunchedi) [12:52:01] !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-web1001.eqiad.wmnet [12:52:49] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host an-conf1006.eqiad.wmnet with OS bookworm [12:52:56] 10ops-eqiad, 06SRE, 06Data-Engineering, 06DC-Ops: Q4:rack/setup/install an-conf100[4-6] - https://phabricator.wikimedia.org/T364429#10041984 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host an-conf1006.eqiad.wmnet with OS bookworm [12:53:27] !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-tool1011.eqiad.wmnet [12:55:00] !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-tool1008.eqiad.wmnet [12:57:21] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-tool1011.eqiad.wmnet [12:57:56] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-web1001.eqiad.wmnet [12:58:05] FIRING: KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [12:58:54] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-tool1008.eqiad.wmnet [13:00:05] Lucas_WMDE, Urbanecm, awight, and TheresNoTime: #bothumor My software never has bugs. It just develops random features. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240805T1300). [13:00:05] joelyrookewmde: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:20] helloo [13:00:42] !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-tool1007.eqiad.wmnet [13:03:38] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-conf1006.eqiad.wmnet with reason: host reimage [13:04:35] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-tool1007.eqiad.wmnet [13:07:10] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-conf1006.eqiad.wmnet with reason: host reimage [13:09:14] (03PS1) 10Brouberol: cloudnative-pg: define network policies allowing traffic to and from the k8s API server [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059887 (https://phabricator.wikimedia.org/T364797) [13:09:53] (03CR) 10CI reject: [V:04-1] cloudnative-pg: define network policies allowing traffic to and from the k8s API server [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059887 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol) [13:12:31] (03PS2) 10Brouberol: cloudnative-pg: define network policies allowing traffic to and from the k8s API server [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059887 (https://phabricator.wikimedia.org/T364797) [13:13:10] (03CR) 10CI reject: [V:04-1] cloudnative-pg: define network policies allowing traffic to and from the k8s API server [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059887 (https://phabricator.wikimedia.org/T364797) (owner: 10Brouberol) [13:14:54] (03PS3) 10Brouberol: cloudnative-pg: define network policies allowing traffic to and from the k8s API server [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059887 (https://phabricator.wikimedia.org/T364797) [13:18:14] (03CR) 10CDanis: [C:03+1] hiera: exclude wikimedia_trust from url bwlim [puppet] - 10https://gerrit.wikimedia.org/r/1059837 (https://phabricator.wikimedia.org/T317799) (owner: 10Vgutierrez) [13:20:14] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-conf1006.eqiad.wmnet with OS bookworm [13:20:21] 10ops-eqiad, 06SRE, 06Data-Engineering, 06DC-Ops: Q4:rack/setup/install an-conf100[4-6] - https://phabricator.wikimedia.org/T364429#10042054 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host an-conf1006.eqiad.wmnet with OS bookworm completed: - an-conf1006 (*... [13:22:46] FIRING: [36x] ProbeDown: Service wdqs1012:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:24:01] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users for Joely Rooke WMDE - https://phabricator.wikimedia.org/T371584#10042064 (10SLyngshede-WMF) [13:24:12] 06SRE, 06Data-Engineering, 06Data-Platform-SRE: Streamline Data Platform access approvals for WMF staff - https://phabricator.wikimedia.org/T370424#10042058 (10SLyngshede-WMF) @Ottomata I'm just removing the SRE-Access-Requests tag to remove this from the Clinic Duty dashboard. [13:30:14] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users for Joely Rooke WMDE - https://phabricator.wikimedia.org/T371584#10042068 (10SLyngshede-WMF) @KFrancis Do we have an NDA for @JoelyRooke-WMDE @JoelyRooke-WMDE without a shell account (S... [13:32:23] (03PS1) 10Btullis: Configure clouddb servers with reuse-parts-test [puppet] - 10https://gerrit.wikimedia.org/r/1059888 (https://phabricator.wikimedia.org/T365424) [13:32:49] (03CR) 10FNegri: [C:03+1] Configure clouddb servers with reuse-parts-test [puppet] - 10https://gerrit.wikimedia.org/r/1059888 (https://phabricator.wikimedia.org/T365424) (owner: 10Btullis) [13:35:11] (03CR) 10Btullis: [C:03+2] Configure clouddb servers with reuse-parts-test [puppet] - 10https://gerrit.wikimedia.org/r/1059888 (https://phabricator.wikimedia.org/T365424) (owner: 10Btullis) [13:39:46] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users for Joely Rooke WMDE - https://phabricator.wikimedia.org/T371584#10042083 (10JoelyRooke-WMDE) I believe I have already signed the NDA when I got basic LDAP access (https://phabricator.wiki... [13:39:58] !log filippo@cumin1002 START - Cookbook sre.hosts.reimage for host prometheus2007.codfw.wmnet with OS bookworm [13:44:43] !log fnegri@cumin1002 START - Cookbook sre.hosts.reimage for host clouddb1019.eqiad.wmnet with OS bookworm [13:48:35] 06SRE, 06Editing-team, 06Growth-Team, 10MediaWiki-Debug-Logger, and 3 others: Flow internal error on frwiki not in logstash - https://phabricator.wikimedia.org/T371586#10042096 (10VPuffetMichel) [13:54:23] FIRING: [2x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:55:30] 06SRE, 06Editing-team, 06Growth-Team, 10MediaWiki-Debug-Logger, and 3 others: Flow internal error on frwiki not in logstash - https://phabricator.wikimedia.org/T371586#10042110 (10VPuffetMichel) Good to know. We are about to start on a spike to Investigate Flow automatic migration approaches with T371738.... [13:57:44] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host db2236.mgmt.codfw.wmnet with reboot policy GRACEFUL [13:59:03] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, August 05 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059374 (https://phabricator.wikimedia.org/T370045) (owner: 10Joely Rooke WMDE) [13:59:38] !log fnegri@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on clouddb1019.eqiad.wmnet with reason: host reimage [14:01:04] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2236.mgmt.codfw.wmnet with reboot policy GRACEFUL [14:01:38] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host db2237.mgmt.codfw.wmnet with reboot policy GRACEFUL [14:02:26] !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on clouddb1019.eqiad.wmnet with reason: host reimage [14:02:49] 06SRE, 10SRE-Access-Requests: Requesting access to for ifeatu_nnaobi_wmde - https://phabricator.wikimedia.org/T371796#10042138 (10Ifeatu_Nnaobi_WMDE) >>! In T371796#10041787, @Aklapper wrote: > @Ifeatu_Nnaobi_WMDE Hi, why is this task assigned to Fabfur? Sorry, not sure who to as... [14:04:02] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2237.mgmt.codfw.wmnet with reboot policy GRACEFUL [14:10:41] FIRING: [3x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:11:55] !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host dumpsdata1003.eqiad.wmnet [14:14:20] (03CR) 10Brouberol: [C:03+1] Remove remaining references to cloudb1021 [puppet] - 10https://gerrit.wikimedia.org/r/1059854 (https://phabricator.wikimedia.org/T368518) (owner: 10Btullis) [14:15:10] (03CR) 10Brouberol: [C:03+1] "LGTM but I'd say don't merge before getting a +1 from data-persistence" [puppet] - 10https://gerrit.wikimedia.org/r/1048390 (https://phabricator.wikimedia.org/T365453) (owner: 10Btullis) [14:15:41] FIRING: [3x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:18:49] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dumpsdata1003.eqiad.wmnet [14:18:59] (03CR) 10CDanis: [C:03+2] jaeger: enable archive support in query and ui [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059390 (https://phabricator.wikimedia.org/T371390) (owner: 10CDanis) [14:20:05] (03Merged) 10jenkins-bot: jaeger: enable archive support in query and ui [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059390 (https://phabricator.wikimedia.org/T371390) (owner: 10CDanis) [14:20:43] !log filippo@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host prometheus2007.codfw.wmnet with OS bookworm [14:22:47] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [14:23:04] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [14:23:04] (03PS1) 10Btullis: Update the beta cluster scap targets for dumps [dumps/scap] - 10https://gerrit.wikimedia.org/r/1059891 (https://phabricator.wikimedia.org/T370465) [14:25:20] !log cdanis@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: apply [14:25:34] !log cdanis@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: apply [14:26:19] (03PS1) 10CDanis: bump jaeger chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059892 [14:26:30] (03CR) 10CDanis: [C:03+2] bump jaeger chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059892 (owner: 10CDanis) [14:27:32] (03PS1) 10Btullis: Update the mediawiki-installation dsh group with new beta snapshot host [puppet] - 10https://gerrit.wikimedia.org/r/1059893 (https://phabricator.wikimedia.org/T370465) [14:28:03] (03Merged) 10jenkins-bot: bump jaeger chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059892 (owner: 10CDanis) [14:30:51] (03PS2) 10Andrew Bogott: wmf_sink: rip out the proxy-cleanup code [puppet] - 10https://gerrit.wikimedia.org/r/1059409 (https://phabricator.wikimedia.org/T371707) [14:35:16] !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host dumpsdata1006.eqiad.wmnet [14:35:25] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host db2238.mgmt.codfw.wmnet with reboot policy GRACEFUL [14:35:38] !log cdanis@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: apply [14:36:09] !log cdanis@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: apply [14:39:23] FIRING: [3x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:43:00] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dumpsdata1006.eqiad.wmnet [14:43:13] 06SRE, 10SRE-Access-Requests: Requesting access to for ifeatu_nnaobi_wmde - https://phabricator.wikimedia.org/T371796#10042282 (10Aklapper) a:05Fabfurโ†’03None Noone, in general. :) It's up to people what they plan to work on. [14:43:52] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2238.mgmt.codfw.wmnet with reboot policy GRACEFUL [14:49:23] FIRING: [6x] SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:49:41] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host db2240.mgmt.codfw.wmnet with reboot policy GRACEFUL [14:50:40] FIRING: [7x] SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:52:14] !log cdanis@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: apply [14:52:41] !log cdanis@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: apply [14:54:23] FIRING: [8x] SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:55:21] (03CR) 10Btullis: [C:03+2] Remove remaining references to cloudb1021 [puppet] - 10https://gerrit.wikimedia.org/r/1059854 (https://phabricator.wikimedia.org/T368518) (owner: 10Btullis) [14:55:34] (03CR) 10Btullis: [C:03+2] Remove references to clouddb1021 once the host has been decommissioned [puppet] - 10https://gerrit.wikimedia.org/r/1048390 (https://phabricator.wikimedia.org/T365453) (owner: 10Btullis) [14:55:59] (03PS1) 10CDanis: jaeger: actually enable archive storage [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059897 (https://phabricator.wikimedia.org/T371390) [14:56:45] (03CR) 10CDanis: [C:03+2] jaeger: actually enable archive storage [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059897 (https://phabricator.wikimedia.org/T371390) (owner: 10CDanis) [14:57:10] (03CR) 10Kamila Souฤkovรก: [C:03+1] "Thank you!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059394 (https://phabricator.wikimedia.org/T369048) (owner: 10Hnowlan) [14:57:39] (03PS1) 10DCausse: cirrus-streaming-updater: bump to v20240805142550-80a0595 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059898 [14:57:43] (03Merged) 10jenkins-bot: jaeger: actually enable archive storage [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059897 (https://phabricator.wikimedia.org/T371390) (owner: 10CDanis) [14:58:24] 10ops-eqiad, 06DC-Ops, 10decommission-hardware, 10Data-Platform-SRE (2024.07.29 - 2024.08.16), 13Patch-For-Review: decommission clouddb1021 - https://phabricator.wikimedia.org/T368518#10042319 (10BTullis) a:05BTullisโ†’03None [14:59:23] FIRING: [8x] SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:59:23] FIRING: [3x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:02:11] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2240.mgmt.codfw.wmnet with reboot policy GRACEFUL [15:03:01] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host db2239.mgmt.codfw.wmnet with reboot policy GRACEFUL [15:03:54] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission payments2002.frack.codfw.wmnet - https://phabricator.wikimedia.org/T371631#10042346 (10Papaul) switch configuration removed [15:04:23] RESOLVED: [8x] SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:04:49] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission payments2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T371630#10042351 (10Papaul) switch configuration removed [15:07:08] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2239.mgmt.codfw.wmnet with reboot policy GRACEFUL [15:14:04] (03PS1) 10Elukey: puppet: unset GIT_INDEX_FILE env var in post-commit hooks [puppet] - 10https://gerrit.wikimedia.org/r/1059899 (https://phabricator.wikimedia.org/T368023) [15:14:23] FIRING: [10x] SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:15:20] !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet,service=s6 [15:15:25] !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet,service=s4 [15:15:40] FIRING: [10x] SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:16:12] !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host clouddb1019.eqiad.wmnet with OS bookworm [15:17:13] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3540/co" [puppet] - 10https://gerrit.wikimedia.org/r/1059899 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey) [15:17:54] (03CR) 10David Caro: [C:03+2] kubeadm: add helm-sudo as pair of kubectl-sudo [puppet] - 10https://gerrit.wikimedia.org/r/1055885 (owner: 10David Caro) [15:19:23] RESOLVED: [6x] SystemdUnitFailed: netbox_ganeti_codfw02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:20:40] FIRING: [5x] SystemdUnitFailed: netbox_ganeti_codfw_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:21:46] (03PS1) 10Ottomata: eventbus: enable instrumentation on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059900 (https://phabricator.wikimedia.org/T363587) [15:22:14] (03PS2) 10Ottomata: eventbus: enable instrumentation on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059900 (https://phabricator.wikimedia.org/T363587) [15:22:26] !log elukey@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [15:22:33] !log elukey@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [15:24:23] RESOLVED: [5x] SystemdUnitFailed: netbox_ganeti_codfw_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:26:55] !log elukey@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [15:27:01] !log elukey@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [15:29:02] (03PS1) 10FNegri: Revert "Configure clouddb servers with reuse-parts-test" [puppet] - 10https://gerrit.wikimedia.org/r/1059902 [15:29:03] !log filippo@cumin1002 START - Cookbook sre.hosts.reimage for host prometheus2007.codfw.wmnet with OS bookworm [15:30:05] jan_drewniak: gettimeofday() says it's time for Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240805T1530) [15:32:48] (03CR) 10Ayounsi: [C:03+2] netbox.netbox-extra: trigger syncdatasource [cookbooks] - 10https://gerrit.wikimedia.org/r/1056989 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [15:33:06] (03CR) 10Elukey: [C:03+2] netbox.netbox-extra: trigger syncdatasource [cookbooks] - 10https://gerrit.wikimedia.org/r/1056989 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [15:35:23] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission payments2002.frack.codfw.wmnet - https://phabricator.wikimedia.org/T371631#10042518 (10Jhancock.wm) a:03Papaul [15:35:47] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission payments2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T371630#10042521 (10Jhancock.wm) a:03Papaul [15:36:41] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission frdb2002.frack.codfw.wmnet - https://phabricator.wikimedia.org/T371629#10042523 (10Jhancock.wm) a:03Papaul [15:36:41] (03Merged) 10jenkins-bot: netbox.netbox-extra: trigger syncdatasource [cookbooks] - 10https://gerrit.wikimedia.org/r/1056989 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [15:36:55] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission frdb2002.frack.codfw.wmnet - https://phabricator.wikimedia.org/T371629#10042537 (10Jhancock.wm) @Papaul disks are removed, servers taken out, and moved to storage. offline script ran. You're free to do your extra steps for frack server dec... [15:37:26] (03PS1) 10Filippo Giunchedi: sre.hosts.reimage: skip asking for puppet version past bullseye [cookbooks] - 10https://gerrit.wikimedia.org/r/1059903 [15:38:41] !log elukey@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [15:39:09] !log elukey@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [15:39:44] !log filippo@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host prometheus2007.codfw.wmnet with OS bookworm [15:40:21] !log filippo@cumin1002 START - Cookbook sre.hosts.reimage for host prometheus2007.codfw.wmnet with OS bookworm [15:41:27] !log filippo@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host prometheus2007.codfw.wmnet with OS bookworm [15:42:00] !log filippo@cumin1002 START - Cookbook sre.hosts.reimage for host prometheus2007.codfw.wmnet with OS bookworm [15:43:52] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: db1238 bus critical errors - https://phabricator.wikimedia.org/T371342#10042567 (10VRiley-WMF) Thanks for the information, I am looking into this [15:45:48] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review: Spicerack: expand Supermicro support in the Redfish module - https://phabricator.wikimedia.org/T365372#10042587 (10elukey) Updates: * The Netbox custom script for network provisioning is now asking for a mac address (f... [15:49:49] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Request additional mgmt IP range for frack servers - https://phabricator.wikimedia.org/T370164#10042616 (10Papaul) 05Resolvedโ†’03Open We cannot ssh into the /25 frack mgmt network. on the /27 we use to do `` ssh -L 8000:10.195.0.1... [15:52:31] (03CR) 10Giuseppe Lavagetto: [C:03+1] "Great find!" [puppet] - 10https://gerrit.wikimedia.org/r/1059899 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey) [15:53:24] !log filippo@cumin1002 START - Cookbook sre.hosts.reimage for host prometheus2008.codfw.wmnet with OS bookworm [15:53:47] (03CR) 10Btullis: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1059902 (owner: 10FNegri) [15:53:52] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission payments2002.frack.codfw.wmnet - https://phabricator.wikimedia.org/T371631#10042668 (10Jhancock.wm) 05Openโ†’03Resolved a:05Papaulโ†’03Jhancock.wm [15:54:08] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission payments2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T371630#10042671 (10Jhancock.wm) 05Openโ†’03Resolved a:05Papaulโ†’03Jhancock.wm [15:54:59] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission frdb2002.frack.codfw.wmnet - https://phabricator.wikimedia.org/T371629#10042675 (10Jhancock.wm) 05Openโ†’03Resolved a:05Papaulโ†’03Jhancock.wm [16:00:02] (03CR) 10Elukey: "LGTM! I left a comment to add a log, other than that it seems good to go!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1059903 (owner: 10Filippo Giunchedi) [16:03:16] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: db1238 bus critical errors - https://phabricator.wikimedia.org/T371342#10042745 (10VRiley-WMF) @Marostegui Is there a preferred time for us to maybe offline this device? I would like try updating some of the firmware. [16:03:44] (03CR) 10Ebernhardson: [C:03+1] cirrus-streaming-updater: bump to v20240805142550-80a0595 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059898 (owner: 10DCausse) [16:04:00] (03CR) 10Vgutierrez: [C:03+2] hiera: exclude wikimedia_trust from url bwlim [puppet] - 10https://gerrit.wikimedia.org/r/1059837 (https://phabricator.wikimedia.org/T317799) (owner: 10Vgutierrez) [16:05:25] (03PS1) 10Ryan Kemper: wdqs: add graph split type to blackbox probe alert [puppet] - 10https://gerrit.wikimedia.org/r/1059909 (https://phabricator.wikimedia.org/T364366) [16:05:47] (03CR) 10DCausse: [C:03+2] cirrus-streaming-updater: bump to v20240805142550-80a0595 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059898 (owner: 10DCausse) [16:07:02] (03CR) 10CDanis: [C:03+1] puppet: unset GIT_INDEX_FILE env var in post-commit hooks [puppet] - 10https://gerrit.wikimedia.org/r/1059899 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey) [16:07:04] (03Merged) 10jenkins-bot: cirrus-streaming-updater: bump to v20240805142550-80a0595 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1059898 (owner: 10DCausse) [16:07:40] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users for Joely Rooke WMDE - https://phabricator.wikimedia.org/T371584#10042756 (10Dzahn) >>! In T371584#10042067, @SLyngshede-WMF wrote: > @KFrancis Do we have an NDA for @JoelyRooke-WMDE I c... [16:08:15] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 10Data-Platform-SRE (2024.07.29 - 2024.08.16): decommission clouddb1021 - https://phabricator.wikimedia.org/T368518#10042774 (10VRiley-WMF) a:03VRiley-WMF [16:10:27] !log dcausse@deploy1003 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [16:11:12] !log dcausse@deploy1003 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:11:53] !log filippo@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus2007.codfw.wmnet with reason: host reimage [16:12:03] !log filippo@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus2008.codfw.wmnet with reason: host reimage [16:15:09] !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus2007.codfw.wmnet with reason: host reimage [16:18:29] !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus2008.codfw.wmnet with reason: host reimage [16:19:38] !log dcausse@deploy1003 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [16:19:40] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 10Data-Platform-SRE (2024.07.29 - 2024.08.16): decommission clouddb1021 - https://phabricator.wikimedia.org/T368518#10042815 (10VRiley-WMF) [16:20:02] !log dcausse@deploy1003 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:20:44] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 10Data-Platform-SRE (2024.07.29 - 2024.08.16): decommission clouddb1021 - https://phabricator.wikimedia.org/T368518#10042816 (10VRiley-WMF) 05Openโ†’03Resolved This has been decommissioned [16:23:55] (03PS2) 10Ryan Kemper: wdqs: add graph split type to blackbox probe alert [puppet] - 10https://gerrit.wikimedia.org/r/1059909 (https://phabricator.wikimedia.org/T364366) [16:24:00] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1059909 (https://phabricator.wikimedia.org/T364366) (owner: 10Ryan Kemper) [16:27:20] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Request additional mgmt IP range for frack servers - https://phabricator.wikimedia.org/T370164#10042843 (10ayounsi) a:05cmooneyโ†’03Dwisehaupt @Dwisehaupt could you send a patch to add `10.195.1.0/25`to subnet-administration-codfw in... [16:33:06] !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus2007.codfw.wmnet with OS bookworm [16:36:16] !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus2008.codfw.wmnet with OS bookworm [16:36:57] (03CR) 10Ayounsi: [V:03+2 C:03+2] Loopback filter: allow ntp to/from private ranges (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1047998 (https://phabricator.wikimedia.org/T366360) (owner: 10Ayounsi) [16:45:30] (03CR) 10Dzahn: [C:03+2] Add bdr to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/1059471 (https://phabricator.wikimedia.org/T371757) (owner: 10Gerrit maintenance bot) [16:52:08] !log DNS - added new project language 'bdr' - West Coast Bajau - https://en.wikipedia.org/wiki/Sama%E2%80%93Bajaw_languages - T371757 [16:52:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:11] (03CR) 10FNegri: [C:03+2] Revert "Configure clouddb servers with reuse-parts-test" [puppet] - 10https://gerrit.wikimedia.org/r/1059902 (owner: 10FNegri) [16:52:13] !log dcausse@deploy1003 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [16:52:15] T371757: Create Wikipedia West Coast Bajau - https://phabricator.wikimedia.org/T371757 [16:52:35] !log dcausse@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:58:05] FIRING: KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240805T1700) [17:00:05] ryankemper: Your horoscope predicts another Wikidata Query Service weekly deploy deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240805T1700). [17:01:52] (03PS1) 10David Caro: wmcs: move gitlab tokens to a custom wmcs.yaml config [puppet] - 10https://gerrit.wikimedia.org/r/1059916 [17:02:25] (03CR) 10Dzahn: [C:03+1] add GeekyWorks to external_clouds_vendors_nets [puppet] - 10https://gerrit.wikimedia.org/r/1059867 (https://phabricator.wikimedia.org/T371418) (owner: 10Jelto) [17:04:52] !log ryankemper@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs1023.eqiad.wmnet with OS bullseye [17:07:54] (03CR) 10David Caro: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3541/co" [puppet] - 10https://gerrit.wikimedia.org/r/1059916 (owner: 10David Caro) [17:08:23] (03CR) 10David Caro: [V:03+1 C:03+2] "PCC looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1059916 (owner: 10David Caro) [17:12:39] (03CR) 10Dzahn: [C:03+2] add GeekyWorks to external_clouds_vendors_nets [puppet] - 10https://gerrit.wikimedia.org/r/1059867 (https://phabricator.wikimedia.org/T371418) (owner: 10Jelto) [17:22:46] FIRING: [36x] ProbeDown: Service wdqs1012:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:25:47] !log ryankemper@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1023.eqiad.wmnet with reason: host reimage [17:28:27] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1023.eqiad.wmnet with reason: host reimage [18:03:07] (03CR) 10Dzahn: gerrit: enable nft throttling on role level, but just log (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1059416 (https://phabricator.wikimedia.org/T365259) (owner: 10Dzahn) [18:03:48] (03CR) 10Dzahn: [C:03+2] gerrit: enable nft throttling on role level, but just log [puppet] - 10https://gerrit.wikimedia.org/r/1059416 (https://phabricator.wikimedia.org/T365259) (owner: 10Dzahn) [18:07:17] (03CR) 10Andrew Bogott: [C:03+2] git-sync-upstream: execute the entire script as gitpuppet [puppet] - 10https://gerrit.wikimedia.org/r/1058675 (https://phabricator.wikimedia.org/T364492) (owner: 10JHathaway) [18:10:04] (03CR) 10Dzahn: gerrit: set nft throttling policy to drop, only on replica host (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1059417 (https://phabricator.wikimedia.org/T365259) (owner: 10Dzahn) [18:12:25] (03PS9) 10CDobbins: varnish: Add restrictive CSP to upload.wikimedia.org and add tests [puppet] - 10https://gerrit.wikimedia.org/r/1059423 (https://phabricator.wikimedia.org/T117618) [18:17:33] (03PS3) 10Dzahn: gerrit: set nft throttling policy to drop, only on replica host [puppet] - 10https://gerrit.wikimedia.org/r/1059417 (https://phabricator.wikimedia.org/T365259) [18:18:28] (03PS4) 10Dzahn: gerrit: set nft throttling policy to drop, only on replica host [puppet] - 10https://gerrit.wikimedia.org/r/1059417 (https://phabricator.wikimedia.org/T365259) [18:19:15] (03PS2) 10Dzahn: miscweb: switch firewall provider to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1059418 (https://phabricator.wikimedia.org/T370677) [18:21:27] (03PS10) 10CDobbins: varnish: Add restrictive CSP to upload.wikimedia.org and add tests [puppet] - 10https://gerrit.wikimedia.org/r/1059423 (https://phabricator.wikimedia.org/T117618) [18:21:35] (03CR) 10Dzahn: [V:04-1] "https://puppet-compiler.wmflabs.org/output/1059418/3543/miscweb1003.eqiad.wmnet/change.miscweb1003.eqiad.wmnet.err" [puppet] - 10https://gerrit.wikimedia.org/r/1059418 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [18:24:13] (03PS3) 10Dzahn: miscweb: switch firewall provider to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1059418 (https://phabricator.wikimedia.org/T370677) [18:24:34] (03CR) 10Andrew Bogott: [C:03+2] "Seems to be working so far :) If a user runs this script as root it will ruin everything, right? Could we put a check in to error out if" [puppet] - 10https://gerrit.wikimedia.org/r/1058675 (https://phabricator.wikimedia.org/T364492) (owner: 10JHathaway) [18:26:27] joucebot nowandnext [18:26:30] jouncebot nowandnext [18:26:30] No deployments scheduled for the next 1 hour(s) and 33 minute(s) [18:26:30] In 1 hour(s) and 33 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240805T2000) [18:26:55] I'm going to do a couple of scap sync-world tests. [18:27:43] !log dancy@deploy1003 Started scap sync-world: testing updates to repos/releng/release/make-container-image [18:28:29] Completed. [18:30:32] (03PS11) 10CDobbins: varnish: Add restrictive CSP to upload.wikimedia.org and add tests [puppet] - 10https://gerrit.wikimedia.org/r/1059423 (https://phabricator.wikimedia.org/T117618) [18:34:01] 06SRE, 06collaboration-services, 10Stewards-Onboarding-Tool, 10Wikimedia-Mailing-lists, 13Patch-For-Review: stewards1001 / stewards2001: automatically subscribe stewards to mailman lists (was: Enable API access for Mailman3) - https://phabricator.wikimedia.org/T351202#10043282 (10Dzahn) @Urbanecm No prob... [18:34:39] (03PS4) 10Andrew Bogott: git-sync-upstream: rip out uid juggling [puppet] - 10https://gerrit.wikimedia.org/r/1055502 (https://phabricator.wikimedia.org/T364492) [18:35:02] (03PS5) 10Andrew Bogott: git-sync-upstream: remove gitpuppet user from networktests [puppet] - 10https://gerrit.wikimedia.org/r/1055502 (https://phabricator.wikimedia.org/T364492) [18:35:35] (03PS12) 10CDobbins: varnish: Add restrictive CSP to upload.wikimedia.org and add tests [puppet] - 10https://gerrit.wikimedia.org/r/1059423 (https://phabricator.wikimedia.org/T117618) [18:37:02] (03CR) 10Andrew Bogott: [C:03+2] git-sync-upstream: remove gitpuppet user from networktests [puppet] - 10https://gerrit.wikimedia.org/r/1055502 (https://phabricator.wikimedia.org/T364492) (owner: 10Andrew Bogott) [18:39:03] (03PS13) 10CDobbins: varnish: Add restrictive CSP to upload.wikimedia.org and add tests [puppet] - 10https://gerrit.wikimedia.org/r/1059423 (https://phabricator.wikimedia.org/T117618) [18:46:51] (03PS1) 10Ryan Kemper: wdqs: make wdqs-all include graph splits [puppet] - 10https://gerrit.wikimedia.org/r/1059931 (https://phabricator.wikimedia.org/T364077) [18:49:16] (03CR) 10Ryan Kemper: [V:03+2 C:03+2] wdqs: make wdqs-all include graph splits [puppet] - 10https://gerrit.wikimedia.org/r/1059931 (https://phabricator.wikimedia.org/T364077) (owner: 10Ryan Kemper) [18:52:14] !log dancy@deploy1003 Installing scap version "4.96.0" for 211 hosts [18:52:20] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1021.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20240729/ using stat1009.eqiad.wmnet) [18:52:57] !log dancy@deploy1003 Installation of scap version "4.96.0" completed for 211 hosts [18:53:19] !log dancy@deploy1003 Started scap sync-world: testing scap 4.96.0 [18:56:02] (03CR) 10Dzahn: [C:03+2] gerrit: set nft throttling policy to drop, only on replica host [puppet] - 10https://gerrit.wikimedia.org/r/1059417 (https://phabricator.wikimedia.org/T365259) (owner: 10Dzahn) [18:56:31] !log dancy@deploy1003 sync-world aborted: testing scap 4.96.0 (duration: 03m 11s) [18:59:23] FIRING: [2x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:02:00] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde, ldap/nda for Seanleong-WMDE - https://phabricator.wikimedia.org/T371694#10043383 (10KFrancis) Hi @seanleong-WMDE, pleases send your email address and full name to kfrancis@wikimedia.org and I will work on getting the agreement to you. Thanks! [19:05:39] hello! FYI i'm going to deploy a config change that will enable prometheus statslib instrumentation of eventbus ext on all wikis: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1059900 [19:05:51] ๐Ÿ‘๐Ÿพ [19:05:53] (03PS14) 10CDobbins: varnish: Add restrictive CSP to upload.wikimedia.org and add tests [puppet] - 10https://gerrit.wikimedia.org/r/1059423 (https://phabricator.wikimedia.org/T117618) [19:06:57] (03CR) 10TrainBranchBot: [C:03+2] "Approved by otto@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059900 (https://phabricator.wikimedia.org/T363587) (owner: 10Ottomata) [19:06:59] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users for Joely Rooke WMDE - https://phabricator.wikimedia.org/T371584#10043398 (10KFrancis) Hi all, I am also confirming we have an NDA on file for @JoelyRooke-WMDE. Thanks! [19:07:34] (03Merged) 10jenkins-bot: eventbus: enable instrumentation on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059900 (https://phabricator.wikimedia.org/T363587) (owner: 10Ottomata) [19:07:46] !log otto@deploy1003 Started scap sync-world: Backport for [[gerrit:1059900|eventbus: enable instrumentation on all wikis (T363587)]] [19:07:49] T363587: [Event Platform] Instrument EventBus with prometheus MW Statslib - https://phabricator.wikimedia.org/T363587 [19:09:53] !log otto@deploy1003 otto: Backport for [[gerrit:1059900|eventbus: enable instrumentation on all wikis (T363587)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [19:10:12] !log otto@deploy1003 otto: Continuing with sync [19:12:03] (03PS15) 10CDobbins: varnish: Add restrictive CSP to upload.wikimedia.org and add tests [puppet] - 10https://gerrit.wikimedia.org/r/1059423 (https://phabricator.wikimedia.org/T117618) [19:14:55] !log otto@deploy1003 Finished scap: Backport for [[gerrit:1059900|eventbus: enable instrumentation on all wikis (T363587)]] (duration: 07m 08s) [19:14:58] T363587: [Event Platform] Instrument EventBus with prometheus MW Statslib - https://phabricator.wikimedia.org/T363587 [19:16:03] (03PS16) 10CDobbins: varnish: Add restrictive CSP to upload.wikimedia.org and add tests [puppet] - 10https://gerrit.wikimedia.org/r/1059423 (https://phabricator.wikimedia.org/T117618) [19:24:47] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, August 06 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059370 (https://phabricator.wikimedia.org/T370802) (owner: 10Urbanecm) [19:24:51] (03PS17) 10CDobbins: varnish: Add restrictive CSP to upload.wikimedia.org and add tests [puppet] - 10https://gerrit.wikimedia.org/r/1059423 (https://phabricator.wikimedia.org/T117618) [19:29:12] (03PS18) 10CDobbins: varnish: Add restrictive CSP to upload.wikimedia.org and add tests [puppet] - 10https://gerrit.wikimedia.org/r/1059423 (https://phabricator.wikimedia.org/T117618) [19:29:19] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_main on wdqs1021.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20240729/ using stat1009.eqiad.wmnet) [19:33:35] (03PS19) 10CDobbins: varnish: Add restrictive CSP to upload.wikimedia.org and add tests [puppet] - 10https://gerrit.wikimedia.org/r/1059423 (https://phabricator.wikimedia.org/T117618) [19:34:47] (03PS20) 10CDobbins: varnish: Add restrictive CSP to upload.wikimedia.org and add tests [puppet] - 10https://gerrit.wikimedia.org/r/1059423 (https://phabricator.wikimedia.org/T117618) [19:37:10] (03PS21) 10CDobbins: varnish: Add restrictive CSP to upload.wikimedia.org and add tests [puppet] - 10https://gerrit.wikimedia.org/r/1059423 (https://phabricator.wikimedia.org/T117618) [19:40:43] (03PS22) 10CDobbins: varnish: Add restrictive CSP to upload.wikimedia.org and add tests [puppet] - 10https://gerrit.wikimedia.org/r/1059423 (https://phabricator.wikimedia.org/T117618) [19:41:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 0% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:41:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-wikifunctions (k8s) 2.089s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [19:41:46] 06SRE, 06collaboration-services, 10Stewards-Onboarding-Tool, 10Wikimedia-Mailing-lists, 13Patch-For-Review: stewards1001 / stewards2001: automatically subscribe stewards to mailman lists (was: Enable API access for Mailman3) - https://phabricator.wikimedia.org/T351202#10043445 (10Urbanecm) Thanks! We'll... [19:46:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-wikifunctions at eqiad: 0% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:46:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-wikifunctions (k8s) 1.947s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-wikifunctions - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [19:49:51] (03CR) 10CDobbins: "PCC results: https://puppet-compiler.wmflabs.org/output/1059423/3554/cp4045.ulsfo.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1059423 (https://phabricator.wikimedia.org/T117618) (owner: 10CDobbins) [19:53:45] (03PS23) 10CDobbins: varnish: Add restrictive CSP to upload.wikimedia.org and add tests [puppet] - 10https://gerrit.wikimedia.org/r/1059423 (https://phabricator.wikimedia.org/T117618) [19:57:19] 10ops-eqiad, 06SRE, 06Data-Engineering, 06DC-Ops: Q4:rack/setup/install an-conf100[4-6] - https://phabricator.wikimedia.org/T364429#10043578 (10Jclark-ctr) [19:57:30] 10ops-eqiad, 06SRE, 06Data-Engineering, 06DC-Ops: Q4:rack/setup/install an-conf100[4-6] - https://phabricator.wikimedia.org/T364429#10043580 (10Jclark-ctr) 05Openโ†’03Resolved [19:58:38] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10043589 (10Jclark-ctr) a:03Jclark-ctr [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor I ๏ฟฝ Unicode. All rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240805T2000). [20:00:05] joelyrookewmde: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:39] hello helloo [20:01:19] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: CPU 2 machine check error detected for rdb1014.eqiad.wmnet - https://phabricator.wikimedia.org/T370633#10043597 (10Jclark-ctr) updated firmwares per dells request last week monitoring if any errors return [20:01:29] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T371741#10043590 (10Jclark-ctr) 05Openโ†’03Resolved a:03Jclark-ctr [20:04:23] FIRING: [3x] SystemdUnitFailed: netbox_report_coherence_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:18:40] hi joelyrookewmde - do you still someone to deploy? sorry i'm late to the window [20:18:55] Hi! No worries at all, that would be great [20:19:05] sure thing - 1 sec [20:19:23] (03PS2) 10Joely Rooke WMDE: Add wikibase client interaction stream to Event Logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059374 (https://phabricator.wikimedia.org/T370045) [20:20:21] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059374 (https://phabricator.wikimedia.org/T370045) (owner: 10Joely Rooke WMDE) [20:21:00] (03Merged) 10jenkins-bot: Add wikibase client interaction stream to Event Logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1059374 (https://phabricator.wikimedia.org/T370045) (owner: 10Joely Rooke WMDE) [20:21:10] !log cjming@deploy1003 Started scap sync-world: Backport for [[gerrit:1059374|Add wikibase client interaction stream to Event Logging (T370045)]] [20:21:16] T370045: Monitor sidebar wikidata link usage - https://phabricator.wikimedia.org/T370045 [20:23:08] !log cjming@deploy1003 cjming, joelyrookewmde: Backport for [[gerrit:1059374|Add wikibase client interaction stream to Event Logging (T370045)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:23:49] joelyrookewmde: ok to sync? up on test servers [20:24:01] just taking a look [20:27:35] I'm just trying to find a wikidata sitelink to check there is the tracking applied, unfortunately I keep getting sidetracked off mwdebug [20:28:37] Wait sorry I'm still a little new, could you possibly send me a link to the test servers? I think i am looking in the wrong place [20:29:06] it depends on where it's deployed -- do you know which wiki? [20:30:21] i'm looking at the patches on the ticket to see - i'm assuming it went out on some pilot wikis first [20:31:05] yep the other changes of the ticket should already all be deployed on all wikis [20:32:11] I just need any page (e.g. wikipedia article) that links to any wikidata item [20:33:49] huh - maybe https://en.wikipedia.org/wiki/Albert_Einstein? do you have the mwdebug extension installed? [20:35:31] oh shoot no I don't [20:35:35] one sec let me look [20:35:51] it's a browser extension that you can turn on/off to test on a mwdebug server [20:36:41] sorry - it's called WikimediaDebug [20:37:34] yeah I read about that and forgot to install it [20:37:51] either way, I can see the changes working now in the beta cluster? [20:38:17] Is that good enough or shall I also check with the mwdebug ? [20:38:34] huh - i think that's fine - fwiw - the change set you're deploying is necessary to get events sent iirc [20:38:59] yep that's the goal haha [20:39:10] ok ideal [20:39:22] next time I will get the extension ready also [20:39:28] alrighty - syncing! [20:39:32] !log cjming@deploy1003 cjming, joelyrookewmde: Continuing with sync [20:39:32] thank youuuuu [20:39:53] np :) [20:44:02] !log cjming@deploy1003 Finished scap: Backport for [[gerrit:1059374|Add wikibase client interaction stream to Event Logging (T370045)]] (duration: 22m 52s) [20:44:05] T370045: Monitor sidebar wikidata link usage - https://phabricator.wikimedia.org/T370045 [20:44:25] joelyrookewmde: should be live! [20:44:48] i think there's possibly a 30 min latency for EventGate to start picking up events [20:46:46] okie dokie I will check it out tomorrow! Have a great afternoon :)) [20:46:52] you too! [20:47:18] !log end of UTC late backport window [20:47:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:18] (03PS1) 10Ahmon Dancy: Add new image building command for mwbuilder sudo [puppet] - 10https://gerrit.wikimedia.org/r/1059942 [20:58:05] FIRING: KubernetesCalicoDown: ml-serve2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [20:58:35] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10043717 (10Jclark-ctr) wikikube-worker1270. #. 3590 port. 8 wikikube-worker1271. #. 3179 port. 25 wikikube-worker1272. #. 2647 port. 16 wikikube-work... [20:58:46] (03CR) 10CI reject: [V:04-1] Add new image building command for mwbuilder sudo [puppet] - 10https://gerrit.wikimedia.org/r/1059942 (owner: 10Ahmon Dancy) [20:59:39] (03PS2) 10Ahmon Dancy: Add new image building command for mwbuilder sudo [puppet] - 10https://gerrit.wikimedia.org/r/1059942 [21:00:05] Reedy, sbassett, Maryum, and manfredi: #bothumor My software never has bugs. It just develops random features. Rise for Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240805T2100). [21:00:23] (03PS3) 10Ahmon Dancy: Add new image building command for mwbuilder sudo [puppet] - 10https://gerrit.wikimedia.org/r/1059942 [21:00:59] (03CR) 10Ahmon Dancy: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1059942 (owner: 10Ahmon Dancy) [21:20:19] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde, ldap/nda for Seanleong-WMDE - https://phabricator.wikimedia.org/T371694#10043744 (10Dzahn) a:03seanleong-WMDE [21:22:46] FIRING: [36x] ProbeDown: Service wdqs1012:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:27:18] 06SRE, 10SRE-Access-Requests: Requesting access to for ifeatu_nnaobi_wmde - https://phabricator.wikimedia.org/T371796#10043749 (10Dzahn) @ifeatu_nnaobi_wmde Could you please send an email to [[ https://meta.wikimedia.org/wiki/User:KFrancis_(WMF) | Katie Francis ]] and tell her you... [21:36:02] (03PS1) 10BCornwall: hieradata: Remove traffic-cache-atstext-buster [puppet] - 10https://gerrit.wikimedia.org/r/1059943 [21:37:29] (03PS1) 10Dzahn: admin: add Joely Rooke (WMDE) to analytics-privatedata, no shell acccess [puppet] - 10https://gerrit.wikimedia.org/r/1059944 (https://phabricator.wikimedia.org/T371584) [21:39:26] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users for Joely Rooke WMDE - https://phabricator.wikimedia.org/T371584#10043764 (10Dzahn) User should be converted from "LDAP-only" to "analytics-privatedata-users" (looks like shell access, but... [21:41:59] (03CR) 10Dzahn: [C:04-1] "User is currently in the "LDAP_only" section. To add them to analytics-privatedata-users they need to be converted to the "shell" section," [puppet] - 10https://gerrit.wikimedia.org/r/1059371 (https://phabricator.wikimedia.org/T371689) (owner: 10Fabfur) [21:44:53] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users for Cynthia Makonyango WMDE - https://phabricator.wikimedia.org/T371689#10043774 (10Dzahn) side comment: Is it technically even possible to have approvals before we know what is being app... [21:47:20] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users for Cynthia Makonyango WMDE - https://phabricator.wikimedia.org/T371689#10043769 (10Dzahn) I'm pretty sure this access would be like T371584 for Joely Rooke, so analytics-privatedata-user... [21:51:17] (03CR) 10Dzahn: [C:04-1] "I would recommend adding only one user per patch. Rarely will the tickets be ready at the same time and handing over clinic duty to next w" [puppet] - 10https://gerrit.wikimedia.org/r/1059371 (https://phabricator.wikimedia.org/T371689) (owner: 10Fabfur) [21:52:19] (03PS2) 10Dzahn: admin: add wmdecyn to analytics-privatedata-users, no shell [puppet] - 10https://gerrit.wikimedia.org/r/1059371 (https://phabricator.wikimedia.org/T371689) (owner: 10Fabfur) [21:53:03] (03CR) 10Dzahn: "I had made https://gerrit.wikimedia.org/r/c/operations/puppet/+/1059944 before realizing this patch also existed. Then amended here to mak" [puppet] - 10https://gerrit.wikimedia.org/r/1059371 (https://phabricator.wikimedia.org/T371689) (owner: 10Fabfur) [21:53:46] (03CR) 10Dzahn: [C:03+1] admin: add wmdecyn to analytics-privatedata-users, no shell [puppet] - 10https://gerrit.wikimedia.org/r/1059371 (https://phabricator.wikimedia.org/T371689) (owner: 10Fabfur) [21:54:46] (03CR) 10Dzahn: [C:03+1] "will need rebase on https://gerrit.wikimedia.org/r/c/operations/puppet/+/1059371 if that is merged first or vice versa" [puppet] - 10https://gerrit.wikimedia.org/r/1059944 (https://phabricator.wikimedia.org/T371584) (owner: 10Dzahn) [21:58:05] (03CR) 10Dzahn: "This patch is from 2022 and it's an access request. But there is no access request ticket to go with it. So it won't be seen and processed" [puppet] - 10https://gerrit.wikimedia.org/r/850160 (https://phabricator.wikimedia.org/T321722) (owner: 10Awight) [21:59:47] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to analytics-privatedata-users for Cynthia Makonyango WMDE - https://phabricator.wikimedia.org/T371689#10043786 (10Dzahn) I amended to https://gerrit.wikimedia.org/r/c/operations/puppet/+/1059371 in a way which I think... [22:25:36] (03CR) 10Dzahn: "due to my lack of context, a ticket and sudo rules being involved and based on based on git-blame of this file, I would prefer if Giuseppe" [puppet] - 10https://gerrit.wikimedia.org/r/1059942 (owner: 10Ahmon Dancy) [22:32:03] (03Abandoned) 10Awight: Invite some of WMDE Tech Wishes team to poke around maps instances [puppet] - 10https://gerrit.wikimedia.org/r/850160 (https://phabricator.wikimedia.org/T321722) (owner: 10Awight) [22:59:23] FIRING: [2x] JobUnavailable: Reduced availability for job netbox_django in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:29:54] (03PS6) 10Jdlrobson: Promote dark mode for anons on various wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1058683 (https://phabricator.wikimedia.org/T371070) [23:38:32] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1059951 [23:38:33] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1059951 (owner: 10TrainBranchBot) [23:51:11] FIRING: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:56:11] RESOLVED: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:59:58] (03CR) 10Cwhite: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1059265 (https://phabricator.wikimedia.org/T371492) (owner: 10Filippo Giunchedi)