[00:02:49] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [00:04:05] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [00:04:07] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-main2009.codfw.wmnet with OS bullseye [00:04:16] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010 - https://phabricator.wikimedia.org/T363209#9810570 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kafka-main2009.codfw.wmnet with OS bullseye completed:... [00:04:39] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010 - https://phabricator.wikimedia.org/T363209#9810571 (10Papaul) @Jhancock.wm thank you for working on this. Like I mentioned to you this morning the reason kafka-main2009 was failing is because it was con... [00:04:53] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010 - https://phabricator.wikimedia.org/T363209#9810572 (10Papaul) [00:05:14] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010 - https://phabricator.wikimedia.org/T363209#9810573 (10Papaul) 05Open→03Resolved [00:05:25] FIRING: SystemdUnitFailed: logrotate.service on elastic2090:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:06:34] 06SRE, 10SRE-Access-Requests: Requesting access to for - https://phabricator.wikimedia.org/T365308 (10ecarg) 03NEW [00:06:45] FIRING: [2x] SystemdUnitFailed: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:07:49] 06SRE, 10SRE-Access-Requests: Requesting access to for - https://phabricator.wikimedia.org/T365308#9810591 (10ecarg) Deployment access was approved and resolved very recently: https://phabricator.wikimedia.org/T364414, but from unsuccessful attempts to ssh into Prod... [00:08:07] 06SRE, 10SRE-Access-Requests: Requesting access to for - https://phabricator.wikimedia.org/T365308#9810593 (10ecarg) [00:08:54] 06SRE, 10SRE-Access-Requests: Requesting update to SSH key for access to deployment for ecarg/Grace Choi - https://phabricator.wikimedia.org/T365308#9810594 (10ecarg) [00:17:51] 06SRE, 10SRE-Access-Requests: Requesting update to SSH key for access to deployment for ecarg/Grace Choi - https://phabricator.wikimedia.org/T365308#9810606 (10Peachey88) 05Open→03Invalid Lets just track this in {T364414} [00:18:55] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for ecarg/Grace Choi - https://phabricator.wikimedia.org/T364414#9810609 (10Peachey88) 05Resolved→03Open >>! In T365308#9810591, @ecarg wrote: > Deployment access was approved and resolved very recently: https://phabricator.wikimedia.org/T364414... [00:21:45] FIRING: [2x] HelmReleaseBadStatus: Helm release datasets-config/production on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [00:35:12] !log ryankemper@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: elastic2090* for ban elastic2090 before reimage - ryankemper@cumin2002 - T353878 [00:35:15] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: elastic2090* for ban elastic2090 before reimage - ryankemper@cumin2002 - T353878 [00:35:16] T353878: Service implementation for elastic2087-2109 - https://phabricator.wikimedia.org/T353878 [00:45:12] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2114.codfw.wmnet with reason: Maintenance [00:45:15] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2114.codfw.wmnet with reason: Maintenance [00:50:33] PROBLEM - Check whether ferm is active by checking the default input chain on mw1415 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [01:18:21] !log ryankemper@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2090.codfw.wmnet with OS bullseye [01:20:33] RECOVERY - Check whether ferm is active by checking the default input chain on mw1415 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [01:49:34] PROBLEM - snapshot of s6 in eqiad on backupmon1001 is CRITICAL: Last snapshot for s6 at eqiad (db1225) taken on 2024-05-18 01:01:36 is 457 GiB, but the previous one was 555 GiB, a change of -17.7 % https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [02:36:45] FIRING: [2x] JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:39:16] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic2090.codfw.wmnet with OS bullseye [02:51:06] FIRING: KubernetesAPINotScrapable: k8s@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [03:01:45] FIRING: [2x] JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:03:39] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2124.codfw.wmnet with reason: Maintenance [03:03:52] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2124.codfw.wmnet with reason: Maintenance [03:04:00] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2124 (T352010)', diff saved to https://phabricator.wikimedia.org/P62594 and previous config saved to /var/cache/conftool/dbconfig/20240518-030359-ladsgroup.json [03:04:04] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [03:46:45] FIRING: SystemdUnitFailed: httpbb_hourly_appserver.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:51:45] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-openldap-exporter.service on serpens:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:06:45] FIRING: [2x] SystemdUnitFailed: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:10:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:15:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:17:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:19:10] Gerrit seems to be down ? [04:20:32] (Looks like it's back up) [04:21:45] FIRING: [2x] HelmReleaseBadStatus: Helm release datasets-config/production on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [04:22:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:41:58] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin1002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [04:42:58] (03PS2) 10Pppery: Re-extract i18n to pick up latest changes [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1032094 (https://phabricator.wikimedia.org/T363188) [04:43:36] (03PS3) 10Pppery: Re-extract i18n to pick up latest changes [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1032094 (https://phabricator.wikimedia.org/T363188) [04:43:55] RESOLVED: SystemdUnitFailed: httpbb_hourly_appserver.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:44:24] (03PS4) 10Pppery: Re-extract i18n to pick up latest changes [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1032094 (https://phabricator.wikimedia.org/T363188) [05:05:35] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2121 (T352010)', diff saved to https://phabricator.wikimedia.org/P62595 and previous config saved to /var/cache/conftool/dbconfig/20240518-050535-ladsgroup.json [05:05:39] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [05:17:54] PROBLEM - mailman3_queue_size on lists1001 is CRITICAL: CRITICAL: 1 mailman3 queues above limits: bounces is 1383 (limit: 25) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3 [05:20:43] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2121', diff saved to https://phabricator.wikimedia.org/P62596 and previous config saved to /var/cache/conftool/dbconfig/20240518-052043-ladsgroup.json [05:35:52] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2121', diff saved to https://phabricator.wikimedia.org/P62597 and previous config saved to /var/cache/conftool/dbconfig/20240518-053550-ladsgroup.json [05:49:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T364299)', diff saved to https://phabricator.wikimedia.org/P62598 and previous config saved to /var/cache/conftool/dbconfig/20240518-054942-marostegui.json [05:49:46] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [05:51:02] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2121 (T352010)', diff saved to https://phabricator.wikimedia.org/P62599 and previous config saved to /var/cache/conftool/dbconfig/20240518-055100-ladsgroup.json [05:51:04] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2122.codfw.wmnet with reason: Maintenance [05:51:06] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [05:51:17] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2122.codfw.wmnet with reason: Maintenance [05:51:25] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2122 (T352010)', diff saved to https://phabricator.wikimedia.org/P62600 and previous config saved to /var/cache/conftool/dbconfig/20240518-055125-ladsgroup.json [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:04:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P62601 and previous config saved to /var/cache/conftool/dbconfig/20240518-060450-marostegui.json [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:12:54] RECOVERY - mailman3_queue_size on lists1001 is OK: OK: mailman3 queues are below the limits https://wikitech.wikimedia.org/wiki/Mailman/Monitoring https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3 [06:19:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P62602 and previous config saved to /var/cache/conftool/dbconfig/20240518-061958-marostegui.json [06:31:39] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T352010)', diff saved to https://phabricator.wikimedia.org/P62603 and previous config saved to /var/cache/conftool/dbconfig/20240518-063138-ladsgroup.json [06:31:45] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [06:35:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T364299)', diff saved to https://phabricator.wikimedia.org/P62604 and previous config saved to /var/cache/conftool/dbconfig/20240518-063505-marostegui.json [06:35:08] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance [06:35:13] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [06:35:22] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance [06:35:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2195 (T364299)', diff saved to https://phabricator.wikimedia.org/P62605 and previous config saved to /var/cache/conftool/dbconfig/20240518-063529-marostegui.json [06:46:48] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P62606 and previous config saved to /var/cache/conftool/dbconfig/20240518-064646-ladsgroup.json [06:51:06] FIRING: KubernetesAPINotScrapable: k8s@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [07:01:45] FIRING: JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:01:56] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P62607 and previous config saved to /var/cache/conftool/dbconfig/20240518-070155-ladsgroup.json [07:17:04] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T352010)', diff saved to https://phabricator.wikimedia.org/P62608 and previous config saved to /var/cache/conftool/dbconfig/20240518-071703-ladsgroup.json [07:17:06] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2151.codfw.wmnet with reason: Maintenance [07:17:09] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [07:17:19] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2151.codfw.wmnet with reason: Maintenance [07:17:27] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2151 (T352010)', diff saved to https://phabricator.wikimedia.org/P62609 and previous config saved to /var/cache/conftool/dbconfig/20240518-071726-ladsgroup.json [07:51:45] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-openldap-exporter.service on serpens:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:03:23] 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T365310 (10phaultfinder) 03NEW [08:06:45] FIRING: [2x] SystemdUnitFailed: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:21:45] FIRING: [2x] HelmReleaseBadStatus: Helm release datasets-config/production on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [08:23:30] RECOVERY - Host ml-serve2002 is UP: PING WARNING - Packet loss = 66%, RTA = 30.27 ms [08:24:28] PROBLEM - SSH on ml-serve2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:29:54] PROBLEM - Host ml-serve2002 is DOWN: PING CRITICAL - Packet loss = 100% [10:42:22] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T352010)', diff saved to https://phabricator.wikimedia.org/P62610 and previous config saved to /var/cache/conftool/dbconfig/20240518-104222-ladsgroup.json [10:42:26] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [10:49:20] (03PS1) 10GergesShamon: [frwiktionary] Create new namespace Convention & associated talk [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1033108 [10:49:58] (03CR) 10CI reject: [V:04-1] [frwiktionary] Create new namespace Convention & associated talk [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1033108 (owner: 10GergesShamon) [10:51:06] FIRING: KubernetesAPINotScrapable: k8s@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [10:52:21] (03PS2) 10GergesShamon: [frwiktionary] Create new namespace Convention & associated talk [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1033108 [10:53:09] (03CR) 10CI reject: [V:04-1] [frwiktionary] Create new namespace Convention & associated talk [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1033108 (owner: 10GergesShamon) [10:57:30] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P62611 and previous config saved to /var/cache/conftool/dbconfig/20240518-105729-ladsgroup.json [11:01:45] FIRING: JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:11:41] (03CR) 10Peachey88: "Please add a Bug: line linking to the relevant Phabricator task for this site change !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P62612 and previous config saved to /var/cache/conftool/dbconfig/20240518-111237-ladsgroup.json [11:16:53] (03PS1) 10GergesShamon: [frwiktionary] Create new namespace "Convention" & associated talk [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1033113 (https://phabricator.wikimedia.org/T360989) [11:17:27] (03CR) 10CI reject: [V:04-1] [frwiktionary] Create new namespace "Convention" & associated talk [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1033113 (https://phabricator.wikimedia.org/T360989) (owner: 10GergesShamon) [11:17:51] (03Abandoned) 10GergesShamon: [frwiktionary] Create new namespace Convention & associated talk [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1033108 (owner: 10GergesShamon) [11:26:32] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:27:22] (03PS2) 10GergesShamon: [frwiktionary] Create new namespace "Convention" & associated talk [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1033113 (https://phabricator.wikimedia.org/T360989) [11:27:28] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:27:46] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T352010)', diff saved to https://phabricator.wikimedia.org/P62613 and previous config saved to /var/cache/conftool/dbconfig/20240518-112745-ladsgroup.json [11:27:49] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2158.codfw.wmnet with reason: Maintenance [11:27:50] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [11:28:02] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2158.codfw.wmnet with reason: Maintenance [11:28:04] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [11:28:17] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [11:28:25] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2158 (T352010)', diff saved to https://phabricator.wikimedia.org/P62614 and previous config saved to /var/cache/conftool/dbconfig/20240518-112824-ladsgroup.json [11:28:41] (03CR) 10CI reject: [V:04-1] [frwiktionary] Create new namespace "Convention" & associated talk [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1033113 (https://phabricator.wikimedia.org/T360989) (owner: 10GergesShamon) [11:35:38] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:36:24] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51923 bytes in 0.082 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:36:24] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8616 bytes in 0.282 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:36:28] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 13 Aug 2024 12:55:14 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:51:45] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-openldap-exporter.service on serpens:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:06:45] FIRING: [2x] SystemdUnitFailed: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:12:46] (03PS3) 10GergesShamon: [frwiktionary] Create new namespace "Convention" & associated talk [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1033113 (https://phabricator.wikimedia.org/T360989) [12:21:45] FIRING: [2x] HelmReleaseBadStatus: Helm release datasets-config/production on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [12:55:34] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:55:36] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:57:24] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51923 bytes in 0.047 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:57:28] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8616 bytes in 0.265 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:33:38] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 36 probes of 801 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:38:38] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 28 probes of 801 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:10:40] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 40 probes of 801 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:25:38] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 24 probes of 801 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:36:45] FIRING: [2x] JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:51:06] FIRING: KubernetesAPINotScrapable: k8s@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [15:01:45] FIRING: [2x] JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:05:49] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T352010)', diff saved to https://phabricator.wikimedia.org/P62615 and previous config saved to /var/cache/conftool/dbconfig/20240518-150548-ladsgroup.json [15:05:55] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [15:20:57] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P62616 and previous config saved to /var/cache/conftool/dbconfig/20240518-152056-ladsgroup.json [15:36:05] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P62617 and previous config saved to /var/cache/conftool/dbconfig/20240518-153604-ladsgroup.json [15:43:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195 (T364299)', diff saved to https://phabricator.wikimedia.org/P62618 and previous config saved to /var/cache/conftool/dbconfig/20240518-154343-marostegui.json [15:43:49] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [15:51:13] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T352010)', diff saved to https://phabricator.wikimedia.org/P62619 and previous config saved to /var/cache/conftool/dbconfig/20240518-155112-ladsgroup.json [15:51:15] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2169.codfw.wmnet with reason: Maintenance [15:51:17] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [15:51:29] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2169.codfw.wmnet with reason: Maintenance [15:51:37] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2169 (T352010)', diff saved to https://phabricator.wikimedia.org/P62620 and previous config saved to /var/cache/conftool/dbconfig/20240518-155136-ladsgroup.json [15:51:45] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-openldap-exporter.service on serpens:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:58:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P62621 and previous config saved to /var/cache/conftool/dbconfig/20240518-155852-marostegui.json [16:06:45] FIRING: [2x] SystemdUnitFailed: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:14:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P62622 and previous config saved to /var/cache/conftool/dbconfig/20240518-161400-marostegui.json [16:21:45] FIRING: [2x] HelmReleaseBadStatus: Helm release datasets-config/production on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [16:29:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195 (T364299)', diff saved to https://phabricator.wikimedia.org/P62623 and previous config saved to /var/cache/conftool/dbconfig/20240518-162907-marostegui.json [16:29:12] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2200.codfw.wmnet with reason: Maintenance [16:29:25] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [16:29:25] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2200.codfw.wmnet with reason: Maintenance [16:32:38] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 42 probes of 801 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:37:38] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 25 probes of 801 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [18:16:38] !log ryankemper@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2090.codfw.wmnet with OS bullseye [18:19:30] FIRING: ProbeDown: Service wdqs1018:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1018:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:24:30] RESOLVED: [2x] ProbeDown: Service wdqs1018:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip6) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:33:56] !log ryankemper@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2090.codfw.wmnet with reason: host reimage [18:36:28] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2090.codfw.wmnet with reason: host reimage [18:51:06] FIRING: KubernetesAPINotScrapable: k8s@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [18:56:51] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2090.codfw.wmnet with OS bullseye [18:58:59] !log ryankemper@cumin2002 START - Cookbook sre.elasticsearch.ban Unbanning all hosts in search_codfw [18:59:05] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Unbanning all hosts in search_codfw [19:01:45] FIRING: JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:17:33] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169 (T352010)', diff saved to https://phabricator.wikimedia.org/P62624 and previous config saved to /var/cache/conftool/dbconfig/20240518-191732-ladsgroup.json [19:17:37] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [19:32:43] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P62625 and previous config saved to /var/cache/conftool/dbconfig/20240518-193240-ladsgroup.json [19:47:51] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P62626 and previous config saved to /var/cache/conftool/dbconfig/20240518-194750-ladsgroup.json [19:51:45] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-openldap-exporter.service on serpens:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:02:59] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169 (T352010)', diff saved to https://phabricator.wikimedia.org/P62627 and previous config saved to /var/cache/conftool/dbconfig/20240518-200258-ladsgroup.json [20:03:01] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2180.codfw.wmnet with reason: Maintenance [20:03:03] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [20:03:14] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2180.codfw.wmnet with reason: Maintenance [20:03:22] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2180 (T352010)', diff saved to https://phabricator.wikimedia.org/P62628 and previous config saved to /var/cache/conftool/dbconfig/20240518-200322-ladsgroup.json [20:06:45] FIRING: [2x] SystemdUnitFailed: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:21:45] FIRING: [2x] HelmReleaseBadStatus: Helm release datasets-config/production on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [21:42:01] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T352010)', diff saved to https://phabricator.wikimedia.org/P62629 and previous config saved to /var/cache/conftool/dbconfig/20240518-214200-ladsgroup.json [21:42:08] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [21:57:09] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P62630 and previous config saved to /var/cache/conftool/dbconfig/20240518-215708-ladsgroup.json [22:12:17] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P62631 and previous config saved to /var/cache/conftool/dbconfig/20240518-221216-ladsgroup.json [22:22:13] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T352010)', diff saved to https://phabricator.wikimedia.org/P62632 and previous config saved to /var/cache/conftool/dbconfig/20240518-222212-ladsgroup.json [22:22:17] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [22:27:25] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T352010)', diff saved to https://phabricator.wikimedia.org/P62633 and previous config saved to /var/cache/conftool/dbconfig/20240518-222725-ladsgroup.json [22:27:28] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2150.codfw.wmnet with reason: Maintenance [22:27:30] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [22:27:41] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2150.codfw.wmnet with reason: Maintenance [22:27:49] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2150 (T352010)', diff saved to https://phabricator.wikimedia.org/P62634 and previous config saved to /var/cache/conftool/dbconfig/20240518-222748-ladsgroup.json [22:37:21] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P62635 and previous config saved to /var/cache/conftool/dbconfig/20240518-223720-ladsgroup.json [22:51:06] FIRING: KubernetesAPINotScrapable: k8s@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [22:52:29] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P62636 and previous config saved to /var/cache/conftool/dbconfig/20240518-225228-ladsgroup.json [23:01:46] FIRING: JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:07:37] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T352010)', diff saved to https://phabricator.wikimedia.org/P62637 and previous config saved to /var/cache/conftool/dbconfig/20240518-230736-ladsgroup.json [23:07:39] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2193.codfw.wmnet with reason: Maintenance [23:07:41] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [23:07:52] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2193.codfw.wmnet with reason: Maintenance [23:08:00] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2193 (T352010)', diff saved to https://phabricator.wikimedia.org/P62638 and previous config saved to /var/cache/conftool/dbconfig/20240518-230800-ladsgroup.json [23:38:20] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1033387 [23:38:21] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1033387 (owner: 10TrainBranchBot) [23:51:45] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-openldap-exporter.service on serpens:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed