[00:07:41] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1060943 (owner: 10TrainBranchBot) [00:09:25] FIRING: [2x] SystemdUnitFailed: mwscript-cleanup.service on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:24:04] (03PS2) 10RLazarus: mediawiki: Build sidecars annotation dynamically [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060515 [00:29:29] (03CR) 10RLazarus: "Yes! It means the test output's more verbose, because it brings a bunch of other stuff in, but it does include the annotation we're lookin" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1060515 (owner: 10RLazarus) [00:35:20] (03PS1) 10RLazarus: mwscript_cleanup: Handle when job.status.conditions is None [puppet] - 10https://gerrit.wikimedia.org/r/1060946 [00:35:39] ^ 1060946 addresses the mwscript-cleanup failures [00:50:40] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:17:31] FIRING: [2x] ProbeDown: Service wdqs1022:443 has failed probes (http_wdqs_main_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1022:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:19:57] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T372042#10053329 (10phaultfinder) [01:22:31] FIRING: [5x] ProbeDown: Service wdqs1022:443 has failed probes (http_wdqs_main_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:27:31] FIRING: [5x] ProbeDown: Service wdqs1022:443 has failed probes (http_wdqs_main_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:37:31] FIRING: [6x] ProbeDown: Service wdqs1022:443 has failed probes (http_wdqs_main_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:42:31] FIRING: [5x] ProbeDown: Service wdqs1022:443 has failed probes (http_wdqs_main_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:51:12] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host testhost2001.codfw.wmnet with OS bookworm [01:51:18] 10ops-codfw, 06SRE, 06cloud-services-team, 06DC-Ops: Test new hardware candidate for cloudbackup replacement - https://phabricator.wikimedia.org/T353746#10053337 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host testhost2001.codfw.wmnet with OS bookworm execut... [01:57:31] FIRING: [4x] ProbeDown: Service wdqs1022:443 has failed probes (http_wdqs_main_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:02:31] FIRING: [8x] ProbeDown: Service wdqs1022:443 has failed probes (http_wdqs_main_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:39:24] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:57:31] FIRING: [6x] ProbeDown: Service wdqs1022:443 has failed probes (http_wdqs_main_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:59:24] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:02:31] FIRING: [7x] ProbeDown: Service wdqs1022:443 has failed probes (http_wdqs_main_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:04:25] FIRING: [2x] SystemdUnitFailed: mwscript-cleanup.service on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:07:31] FIRING: [4x] ProbeDown: Service wdqs1022:443 has failed probes (http_wdqs_main_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:12:31] FIRING: [6x] ProbeDown: Service wdqs1022:443 has failed probes (http_wdqs_main_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:17:31] FIRING: [5x] ProbeDown: Service wdqs1022:443 has failed probes (http_wdqs_main_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:22:31] FIRING: [5x] ProbeDown: Service wdqs1022:443 has failed probes (http_wdqs_main_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:02:53] !log ryankemper@deploy1003 Started deploy [wdqs/wdqs@316bf7f]: deploy to freshly reimaged host [04:03:24] !log ryankemper@deploy1003 Finished deploy [wdqs/wdqs@316bf7f]: deploy to freshly reimaged host (duration: 00m 30s) [04:03:34] !log ryankemper@deploy1003 Started deploy [wdqs/wdqs@316bf7f]: deploy to freshly reimaged host [04:03:38] !log ryankemper@deploy1003 Finished deploy [wdqs/wdqs@316bf7f]: deploy to freshly reimaged host (duration: 00m 03s) [04:15:59] (03CR) 10Ryan Kemper: wdqs: store metadata about graph split type (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1053205 (https://phabricator.wikimedia.org/T364077) (owner: 10Ryan Kemper) [04:17:10] (03PS4) 10Ryan Kemper: wdqs: store metadata about graph split type [cookbooks] - 10https://gerrit.wikimedia.org/r/1053205 (https://phabricator.wikimedia.org/T364077) [04:22:21] (03PS5) 10Ryan Kemper: wdqs: store metadata about graph split type [cookbooks] - 10https://gerrit.wikimedia.org/r/1053205 (https://phabricator.wikimedia.org/T364077) [04:24:36] (03PS6) 10Ryan Kemper: wdqs: store metadata about graph split type [cookbooks] - 10https://gerrit.wikimedia.org/r/1053205 (https://phabricator.wikimedia.org/T364077) [04:25:00] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-transfer (T364077, transfer to unpooled host (1022) to test cookbook changes) xfer wikidata from wdqs1012.eqiad.wmnet -> wdqs1022.eqiad.wmnet, repooling source-only afterwards [04:25:03] T364077: Adapt the wdqs data-transfer cookbook to operate with federated subgraphs - https://phabricator.wikimedia.org/T364077 [04:26:14] The wdqs1022 alerts from above are likely from downtiming expired. wdqs1022 is a non-pooled host (not in production). Reapplying downtime [04:26:23] downtime expiring* [04:27:31] RESOLVED: [2x] ProbeDown: Service wdqs2010:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2010:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:32:24] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) (T364077, transfer to unpooled host (1022) to test cookbook changes) xfer wikidata from wdqs1012.eqiad.wmnet -> wdqs1022.eqiad.wmnet, repooling source-only afterwards [04:32:31] FIRING: [4x] ProbeDown: Service wdqs1022:443 has failed probes (http_wdqs_main_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:32:36] T364077: Adapt the wdqs data-transfer cookbook to operate with federated subgraphs - https://phabricator.wikimedia.org/T364077 [04:36:58] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1022:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [04:40:09] !log ryankemper@cumin2002 START - Cookbook sre.hosts.downtime for 15:00:00 on 9 hosts with reason: T364368 non-prod hosts [04:40:12] T364368: Create separate pybal pools for wdqs graph split (main vs scholarly) - https://phabricator.wikimedia.org/T364368 [04:40:35] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 15:00:00 on 9 hosts with reason: T364368 non-prod hosts [04:42:31] RESOLVED: [2x] ProbeDown: Service wdqs2010:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2010:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:50:40] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:24:23] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) reloading wikidata_main on wdqs1021.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20240729/ using stat1009.eqiad.wmnet) [05:24:25] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:32:31] FIRING: ProbeDown: Service wdqs2012:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2012:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:37:31] RESOLVED: ProbeDown: Service wdqs2012:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2012:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240809T0600) [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:19:25] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:53:56] (03CR) 10DCausse: search: use mul fallback for manually-tuned search profiles (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1060449 (https://phabricator.wikimedia.org/T371401) (owner: 10DCausse) [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240809T0700) [07:23:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232 (T367856)', diff saved to https://phabricator.wikimedia.org/P67254 and previous config saved to /var/cache/conftool/dbconfig/20240809-072320-marostegui.json [07:23:24] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [07:25:41] FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:38:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232', diff saved to https://phabricator.wikimedia.org/P67256 and previous config saved to /var/cache/conftool/dbconfig/20240809-073828-marostegui.json [07:53:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232', diff saved to https://phabricator.wikimedia.org/P67257 and previous config saved to /var/cache/conftool/dbconfig/20240809-075335-marostegui.json [08:08:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232 (T367856)', diff saved to https://phabricator.wikimedia.org/P67258 and previous config saved to /var/cache/conftool/dbconfig/20240809-080842-marostegui.json [08:08:44] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 6:00:00 on db1234.eqiad.wmnet with reason: Maintenance [08:08:51] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [08:08:57] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 6:00:00 on db1234.eqiad.wmnet with reason: Maintenance [08:09:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1234 (T367856)', diff saved to https://phabricator.wikimedia.org/P67259 and previous config saved to /var/cache/conftool/dbconfig/20240809-080904-marostegui.json [08:24:24] RESOLVED: SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:27:31] FIRING: [3x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:32:31] RESOLVED: ProbeDown: Service wdqs2010:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2010:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:50:40] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:32:31] FIRING: ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2007:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:37:31] RESOLVED: ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2007:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:47:31] FIRING: [2x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip6) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:52:31] FIRING: [3x] ProbeDown: Service wdqs2010:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip6) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:57:31] RESOLVED: [3x] ProbeDown: Service wdqs2010:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip6) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:09:58] (03PS14) 10Cathal Mooney: Add function to wmf-netbox plugin to provide QoS config data [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1049554 (https://phabricator.wikimedia.org/T339850) [10:10:23] (03CR) 10Cathal Mooney: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1060458 (owner: 10Cathal Mooney) [10:19:40] FIRING: SystemdUnitFailed: mwscript-cleanup.service on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:38:59] (03PS1) 10Gergő Tisza: Temporarily re-add writeapi userright until AWB stops depending on it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1061054 (https://phabricator.wikimedia.org/T372017) [10:46:26] FIRING: RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [10:47:08] (03CR) 10DCausse: [C:03+1] wdqs: store metadata about graph split type [cookbooks] - 10https://gerrit.wikimedia.org/r/1053205 (https://phabricator.wikimedia.org/T364077) (owner: 10Ryan Kemper) [11:00:06] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240809T0700) [11:00:06] eoghan, jelto, arnoldokoth, and mutante: Time to snap out of that daydream and deploy GitLab version upgrades. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240809T1100). [11:01:26] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [11:12:13] (03CR) 10Jforrester: "As AWB will tell people to update, which will fix this issue, is it worth deploying this?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1061054 (https://phabricator.wikimedia.org/T372017) (owner: 10Gergő Tisza) [11:30:26] (03PS15) 10Cathal Mooney: Add function to wmf-netbox plugin to provide QoS config data [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1049554 (https://phabricator.wikimedia.org/T339850) [11:41:26] RESOLVED: RoutinatorRsyncErrors: Routinator rsync fetching issue in eqiad - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [11:50:25] FIRING: [3x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:28:18] (03PS1) 10David Caro: wmcs-backup: don't fail whit backup has no project [puppet] - 10https://gerrit.wikimedia.org/r/1061075 [12:28:35] (03CR) 10David Caro: [C:03+2] wmcs-backup: don't fail whit backup has no project [puppet] - 10https://gerrit.wikimedia.org/r/1061075 (owner: 10David Caro) [12:32:31] FIRING: ProbeDown: Service wdqs2019:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2019:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:32:38] (03CR) 10Gergő Tisza: "Yeah probably not. I didn't realize it is fixed already." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1061054 (https://phabricator.wikimedia.org/T372017) (owner: 10Gergő Tisza) [12:34:40] (03Abandoned) 10Gergő Tisza: Temporarily re-add writeapi userright until AWB stops depending on it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1061054 (https://phabricator.wikimedia.org/T372017) (owner: 10Gergő Tisza) [12:37:31] RESOLVED: [3x] ProbeDown: Service wdqs2012:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:45:25] FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:52:26] FIRING: RoutinatorRsyncErrors: Routinator rsync fetching issue in eqiad - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [13:07:31] FIRING: [3x] ProbeDown: Service wdqs2011:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:12:31] RESOLVED: [5x] ProbeDown: Service wdqs2011:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:19:02] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host alert1002.mgmt.eqiad.wmnet with reboot policy FORCED [13:20:39] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [13:23:43] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt allert1004 - jclark@cumin1002" [13:23:47] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt allert1004 - jclark@cumin1002" [13:23:47] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:24:04] 10ops-eqiad, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install alert1002 - https://phabricator.wikimedia.org/T370111#10054113 (10Jclark-ctr) [13:24:24] FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:29:55] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host alert1002 [13:29:57] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host alert1002 [13:31:42] (03CR) 10Cathal Mooney: Add function to wmf-netbox plugin to provide QoS config data (031 comment) [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1049554 (https://phabricator.wikimedia.org/T339850) (owner: 10Cathal Mooney) [13:35:51] 06SRE, 10SRE Program Management, 06Wikimedia-Design, 07Logos: SRE needs a logo - https://phabricator.wikimedia.org/T312067#10054165 (10Ladsgroup) [13:45:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:50:15] RESOLVED: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:52:41] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host alert1002.mgmt.eqiad.wmnet with reboot policy FORCED [13:52:49] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [13:55:09] 10ops-eqiad, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install prometheus100[78] - https://phabricator.wikimedia.org/T370426#10054224 (10Jclark-ctr) [13:55:50] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt prometheus1007-8 - jclark@cumin1002" [13:55:54] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt prometheus1007-8 - jclark@cumin1002" [13:55:54] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:56:11] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host prometheus1007.mgmt.eqiad.wmnet with reboot policy FORCED [13:56:13] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host prometheus1008.mgmt.eqiad.wmnet with reboot policy FORCED [13:58:33] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host alert1002.wikimedia.org with OS bookworm [13:58:42] 10ops-eqiad, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install alert1002 - https://phabricator.wikimedia.org/T370111#10054225 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host alert1002.wikimedia.org with OS bookworm [14:03:14] 10ops-eqiad, 06SRE, 06Data-Platform, 06DC-Ops: Q1:rack/setup/install an-presto10[16-20] - https://phabricator.wikimedia.org/T370543#10054239 (10Jclark-ctr) @BTullis the site.pp roles are incorrect can you update these so they are insetup? [14:03:36] 10ops-eqiad, 06SRE, 06Data-Platform, 06DC-Ops: Q1:rack/setup/install an-presto10[16-20] - https://phabricator.wikimedia.org/T370543#10054240 (10Jclark-ctr) [14:05:53] 10ops-eqiad, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install prometheus100[78] - https://phabricator.wikimedia.org/T370426#10054241 (10Jclark-ctr) [14:07:12] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host prometheus1007.mgmt.eqiad.wmnet with reboot policy FORCED [14:07:31] FIRING: ProbeDown: Service wdqs2012:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2012:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:10:00] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host prometheus1007.eqiad.wmnet with OS bookworm [14:10:09] 10ops-eqiad, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install prometheus100[78] - https://phabricator.wikimedia.org/T370426#10054243 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host prometheus1007.eqiad.wmnet with OS bookworm [14:13:47] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on alert1002.wikimedia.org with reason: host reimage [14:17:13] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on alert1002.wikimedia.org with reason: host reimage [14:17:31] RESOLVED: ProbeDown: Service wdqs2012:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2012:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:19:24] RESOLVED: SystemdUnitFailed: generate_vrts_aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:19:40] FIRING: SystemdUnitFailed: mwscript-cleanup.service on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:20:08] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host prometheus1008.mgmt.eqiad.wmnet with reboot policy FORCED [14:20:42] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host prometheus1008.eqiad.wmnet with OS bookworm [14:20:54] 10ops-eqiad, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install prometheus100[78] - https://phabricator.wikimedia.org/T370426#10054248 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host prometheus1008.eqiad.wmnet with OS bookworm [14:26:52] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus1007.eqiad.wmnet with reason: host reimage [14:27:31] FIRING: [4x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip6) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:32:31] RESOLVED: [5x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip6) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:32:52] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus1007.eqiad.wmnet with reason: host reimage [14:33:53] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [14:34:52] (03PS1) 10Zabe: beta: Introduce encrypted PBKDF2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1061086 [14:35:25] FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:37:27] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus1008.eqiad.wmnet with reason: host reimage [14:39:24] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:40:04] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus1008.eqiad.wmnet with reason: host reimage [14:40:13] (03CR) 10Zabe: [C:03+2] beta: Introduce encrypted PBKDF2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1061086 (owner: 10Zabe) [14:41:18] (03Merged) 10jenkins-bot: beta: Introduce encrypted PBKDF2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1061086 (owner: 10Zabe) [14:42:26] RESOLVED: RoutinatorRsyncErrors: Routinator rsync fetching issue in eqiad - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [14:50:23] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [14:52:06] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [14:52:07] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host alert1002.wikimedia.org with OS bookworm [14:52:20] 10ops-eqiad, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install alert1002 - https://phabricator.wikimedia.org/T370111#10054353 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host alert1002.wikimedia.org with OS bookworm completed: - alert1002 (**PASS**) -... [14:53:02] 10ops-eqiad, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install alert1002 - https://phabricator.wikimedia.org/T370111#10054357 (10Jclark-ctr) [14:54:13] 10ops-eqiad, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install alert1002 - https://phabricator.wikimedia.org/T370111#10054358 (10Jclark-ctr) 05Stalled→03Resolved a:03Jclark-ctr [14:57:31] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [14:59:07] (03PS1) 10Zabe: Use encrypted PBKDF2 for wrapping B type passwords instead of Argon2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1061088 (https://phabricator.wikimedia.org/T112359) [14:59:24] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:01:51] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [15:01:52] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus1008.eqiad.wmnet with OS bookworm [15:01:58] 10ops-eqiad, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install prometheus100[78] - https://phabricator.wikimedia.org/T370426#10054375 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host prometheus1008.eqiad.wmnet with OS bookworm completed: - prometheus100... [15:08:24] !log jclark@cumin1002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [15:08:25] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus1007.eqiad.wmnet with OS bookworm [15:08:26] FIRING: RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [15:08:31] 10ops-eqiad, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install prometheus100[78] - https://phabricator.wikimedia.org/T370426#10054389 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host prometheus1007.eqiad.wmnet with OS bookworm completed: - prometheus100... [15:09:01] 10ops-eqiad, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install prometheus100[78] - https://phabricator.wikimedia.org/T370426#10054390 (10Jclark-ctr) [15:09:52] 10ops-eqiad, 06SRE, 06DC-Ops, 10observability: Q1:rack/setup/install prometheus100[78] - https://phabricator.wikimedia.org/T370426#10054391 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr [15:18:26] RESOLVED: RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [15:27:31] FIRING: ProbeDown: Service wdqs2012:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2012:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:29:40] FIRING: RoutinatorRsyncErrors: Routinator rsync fetching issue in eqiad - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [15:30:56] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [15:32:31] RESOLVED: ProbeDown: Service wdqs2012:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2012:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:40:56] RESOLVED: RoutinatorRsyncErrors: Routinator rsync fetching issue in eqiad - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [15:48:24] 10ops-codfw, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T372160 (10phaultfinder) 03NEW [16:02:31] FIRING: ProbeDown: Service wdqs2012:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2012:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:05:26] FIRING: RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [16:07:31] RESOLVED: [2x] ProbeDown: Service wdqs2010:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:26:59] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1260.eqiad.wmnet with OS bullseye [16:27:08] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10054568 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1260.eqiad.wmnet with OS bull... [16:28:07] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1266.eqiad.wmnet with OS bullseye [16:28:12] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10054579 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1266.eqiad.wmnet with OS bull... [16:30:26] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [16:43:12] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [16:45:11] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1266.eqiad.wmnet with reason: host reimage [16:46:16] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt an-presto1016-20 - jclark@cumin1002" [16:46:20] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt an-presto1016-20 - jclark@cumin1002" [16:46:20] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:48:02] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1266.eqiad.wmnet with reason: host reimage [16:48:35] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-presto1016.mgmt.eqiad.wmnet with reboot policy FORCED [16:48:44] 10ops-eqiad, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T371949#10054605 (10phaultfinder) [16:48:50] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-presto1017.mgmt.eqiad.wmnet with reboot policy FORCED [16:49:57] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-presto1018.mgmt.eqiad.wmnet with reboot policy FORCED [16:50:12] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-presto1019.mgmt.eqiad.wmnet with reboot policy FORCED [16:50:14] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-presto1020.mgmt.eqiad.wmnet with reboot policy FORCED [16:50:20] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-presto1018.mgmt.eqiad.wmnet with reboot policy FORCED [16:50:25] FIRING: [3x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:51:17] FIRING: NELHigh: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [16:51:18] FIRING: NELByCountryHigh: Elevated Network Error Logging events (tcp.timed_out from IN) - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryHigh [16:51:28] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-presto1018.mgmt.eqiad.wmnet with reboot policy FORCED [16:55:08] 10ops-eqiad, 06SRE, 06Data-Platform, 06DC-Ops: Q1:rack/setup/install an-presto10[16-20] - https://phabricator.wikimedia.org/T370543#10054606 (10Jclark-ctr) [16:56:17] RESOLVED: NELHigh: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [16:56:18] RESOLVED: NELByCountryHigh: Elevated Network Error Logging events (tcp.timed_out from IN) - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELByCountryHigh [16:56:18] 10ops-eqiad, 06SRE, 06Data-Platform, 06DC-Ops: Q1:rack/setup/install an-presto10[16-20] - https://phabricator.wikimedia.org/T370543#10054607 (10Jclark-ctr) a:03BTullis [17:00:49] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1266.eqiad.wmnet with OS bullseye [17:00:54] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10054608 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1266.eqiad.wmnet with OS bullseye... [17:04:17] FIRING: NELHigh: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [17:06:21] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-presto1017.mgmt.eqiad.wmnet with reboot policy FORCED [17:09:17] RESOLVED: NELHigh: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [17:12:57] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [17:13:41] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1260 [17:14:45] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1260 [17:15:09] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-presto1020.mgmt.eqiad.wmnet with reboot policy FORCED [17:15:17] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:20:32] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-presto1018.mgmt.eqiad.wmnet with reboot policy FORCED [17:20:36] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-presto1019.mgmt.eqiad.wmnet with reboot policy FORCED [17:21:24] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-presto1016.mgmt.eqiad.wmnet with reboot policy FORCED [17:22:31] FIRING: [6x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip6) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:25:16] 10ops-eqiad, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T371949#10054640 (10Jclark-ctr) 05Open→03Resolved Reseated power supply [17:27:31] FIRING: [3x] ProbeDown: Service wdqs2011:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:28:35] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1260.eqiad.wmnet with OS bullseye [17:28:43] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10054653 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1260.eqiad.wmnet with OS bull... [17:32:31] RESOLVED: [3x] ProbeDown: Service wdqs2011:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:51:05] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1260.eqiad.wmnet with reason: host reimage [17:54:17] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1260.eqiad.wmnet with reason: host reimage [18:10:07] !log bking@wdqs-codfw-public mitigate codfw wdqs abuse via nginx hotfix T372074 [18:10:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:25] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [18:14:42] 10ops-eqiad, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T372168 (10phaultfinder) 03NEW [18:19:40] FIRING: SystemdUnitFailed: mwscript-cleanup.service on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:02:02] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [19:02:03] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1260.eqiad.wmnet with OS bullseye [19:02:11] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10054724 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1260.eqiad.wmnet with OS bullseye... [19:03:31] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10054727 (10Jclark-ctr) [19:18:08] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [19:21:11] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt wikikube-worker - jclark@cumin1002" [19:21:15] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt wikikube-worker - jclark@cumin1002" [19:21:15] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:26:35] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1297.mgmt.eqiad.wmnet with reboot policy FORCED [19:26:49] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1298.mgmt.eqiad.wmnet with reboot policy FORCED [19:27:00] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1300.mgmt.eqiad.wmnet with reboot policy FORCED [19:27:13] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1301.mgmt.eqiad.wmnet with reboot policy FORCED [19:27:27] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1299.mgmt.eqiad.wmnet with reboot policy FORCED [19:27:29] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1304.mgmt.eqiad.wmnet with reboot policy FORCED [19:27:51] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1302.mgmt.eqiad.wmnet with reboot policy FORCED [19:27:54] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1303.mgmt.eqiad.wmnet with reboot policy FORCED [19:29:19] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1298.mgmt.eqiad.wmnet with reboot policy FORCED [19:30:42] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1298.mgmt.eqiad.wmnet with reboot policy FORCED [19:32:51] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1298.mgmt.eqiad.wmnet with reboot policy FORCED [19:42:31] FIRING: [2x] ProbeDown: Service wdqs1022:443 has failed probes (http_wdqs_main_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1022:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:47:04] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1297.mgmt.eqiad.wmnet with reboot policy FORCED [19:47:09] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1299.mgmt.eqiad.wmnet with reboot policy FORCED [19:47:21] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1304.mgmt.eqiad.wmnet with reboot policy FORCED [19:47:28] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1302.mgmt.eqiad.wmnet with reboot policy FORCED [19:47:35] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1301.mgmt.eqiad.wmnet with reboot policy FORCED [19:49:20] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1297.eqiad.wmnet with OS bullseye [19:49:33] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10054805 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1297.eqiad.wmnet with OS bull... [19:49:36] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1304.eqiad.wmnet with OS bullseye [19:49:42] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10054806 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1304.eqiad.wmnet with OS bull... [19:49:46] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1299.eqiad.wmnet with OS bullseye [19:49:53] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10054807 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1299.eqiad.wmnet with OS bull... [19:50:38] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1302.eqiad.wmnet with OS bullseye [19:50:47] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10054808 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1302.eqiad.wmnet with OS bull... [19:50:59] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1303.mgmt.eqiad.wmnet with reboot policy FORCED [19:51:11] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1301.eqiad.wmnet with OS bullseye [19:51:18] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10054809 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1301.eqiad.wmnet with OS bull... [19:52:02] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1303.eqiad.wmnet with OS bullseye [19:52:12] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10054819 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1303.eqiad.wmnet with OS bull... [19:52:36] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1300.mgmt.eqiad.wmnet with reboot policy FORCED [19:52:52] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1300.eqiad.wmnet with OS bullseye [19:53:02] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10054825 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1300.eqiad.wmnet with OS bull... [19:54:18] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker1298.mgmt.eqiad.wmnet with reboot policy FORCED [20:03:38] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1298.mgmt.eqiad.wmnet with reboot policy FORCED [20:06:17] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1297.eqiad.wmnet with reason: host reimage [20:08:47] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1297.eqiad.wmnet with reason: host reimage [20:19:39] (03PS1) 10NMW03: Set wgAutoConfirmCount to 10 for azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1061101 (https://phabricator.wikimedia.org/T372172) [20:23:29] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [20:23:44] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [20:23:45] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1297.eqiad.wmnet with OS bullseye [20:23:57] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10054864 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1297.eqiad.wmnet with OS bullseye... [20:25:16] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10054869 (10Jclark-ctr) [20:30:41] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [20:50:40] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-blazegraph-exporter-wdqs-blazegraph.service on wdqs2009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:52:43] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10054913 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1304.eqiad.wmnet with OS bullseye... [20:53:48] FIRING: PuppetFailure: Puppet has failed on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [21:03:48] RESOLVED: PuppetFailure: Puppet has failed on wdqs1024:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [21:07:09] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10054961 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1301.eqiad.wmnet with OS bullseye... [21:07:16] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10054962 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1303.eqiad.wmnet with OS bullseye... [21:07:39] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10054963 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1300.eqiad.wmnet with OS bullseye... [21:07:44] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10054964 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1302.eqiad.wmnet with OS bullseye... [21:09:34] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1299.eqiad.wmnet with OS bullseye [21:09:43] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10054965 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1299.eqiad.wmnet with OS bull... [21:09:53] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1299.eqiad.wmnet with OS bullseye [21:09:59] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10054966 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1299.eqiad.wmnet with OS bullseye... [21:11:46] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1299.eqiad.wmnet with OS bullseye [21:11:58] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10054967 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1299.eqiad.wmnet with OS bull... [21:15:57] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker1266.eqiad.wmnet with OS bullseye [21:16:11] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10054968 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1266.eqiad.wmnet with OS bull... [21:18:29] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker1266.eqiad.wmnet with reason: host reimage [21:21:45] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker1266.eqiad.wmnet with reason: host reimage [21:29:43] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [21:30:37] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [21:30:38] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker1266.eqiad.wmnet with OS bullseye [21:30:44] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10054975 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1266.eqiad.wmnet with OS bullseye... [21:33:40] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10054978 (10Jclark-ctr) @Jhancock.wm @VRiley-WMF I will be on vacation next week 1300-1304 will need to be imaged with no-pxe flag after site.pp file i... [21:33:43] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10054980 (10Jclark-ctr) [21:38:32] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10054993 (10Jclark-ctr) 1299 will also need some attention @Jhancock.wm @VRiley-WMF [21:40:45] 10ops-eqiad, 06SRE, 06Data-Platform, 06DC-Ops: Q1:rack/setup/install an-presto10[16-20] - https://phabricator.wikimedia.org/T370543#10054996 (10Jclark-ctr) @Jhancock.wm @VRiley-WMF these have been provisioned but need Raid configured and imaged once site.pp file is fixed [21:43:32] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: Q1:rack/setup/install gerrit1004 - https://phabricator.wikimedia.org/T369671#10055004 (10Jclark-ctr) [21:44:18] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops: Q1:rack/setup/install gerrit1004 - https://phabricator.wikimedia.org/T369671#10055000 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr [21:47:31] FIRING: [4x] ProbeDown: Service wdqs1022:443 has failed probes (http_wdqs_main_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:17:31] FIRING: [7x] ProbeDown: Service wdqs1022:443 has failed probes (http_wdqs_main_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:19:40] FIRING: SystemdUnitFailed: mwscript-cleanup.service on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:32:01] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wikikube-worker1299.eqiad.wmnet with OS bullseye [22:32:09] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10055019 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1299.eqiad.wmnet with OS bullseye... [23:07:31] FIRING: [6x] ProbeDown: Service wdqs1022:443 has failed probes (http_wdqs_main_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:12:31] FIRING: [5x] ProbeDown: Service wdqs1022:443 has failed probes (http_wdqs_main_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:17:31] FIRING: [5x] ProbeDown: Service wdqs1022:443 has failed probes (http_wdqs_main_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:38:45] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1061110 [23:38:45] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1061110 (owner: 10TrainBranchBot) [23:42:31] FIRING: [4x] ProbeDown: Service wdqs1022:443 has failed probes (http_wdqs_main_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown