[00:06:32] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1079368 (owner: 10TrainBranchBot) [00:38:41] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2219 (T367781)', diff saved to https://phabricator.wikimedia.org/P69665 and previous config saved to /var/cache/conftool/dbconfig/20241011-003840-arnaudb.json [00:38:44] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [00:48:11] 06SRE, 06Content-Transform-Team-WIP, 10MW-on-K8s, 06serviceops, and 4 others: A lot of `[info] Wikitext for this page has duplicate ids:` in logstash for mw-parsoid. Possibly related to PageBundle - https://phabricator.wikimedia.org/T358588#10220378 (10ABreault-WMF) a:03ABreault-WMF [00:53:48] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P69666 and previous config saved to /var/cache/conftool/dbconfig/20241011-005347-arnaudb.json [01:04:41] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10220391 (10phaultfinder) [01:05:26] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2003:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [01:08:54] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P69667 and previous config saved to /var/cache/conftool/dbconfig/20241011-010854-arnaudb.json [01:20:56] FIRING: RdfStreamingUpdaterFlinkProcessingLatencyIsHigh: ... [01:20:56] Processing latency of WDQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=rdf-streaming-updater&var-helm_release=wikidata - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [01:24:01] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2219 (T367781)', diff saved to https://phabricator.wikimedia.org/P69668 and previous config saved to /var/cache/conftool/dbconfig/20241011-012401-arnaudb.json [01:24:04] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on db2237.codfw.wmnet with reason: Maintenance [01:24:05] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [01:24:17] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2237.codfw.wmnet with reason: Maintenance [01:24:24] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2237 (T367781)', diff saved to https://phabricator.wikimedia.org/P69669 and previous config saved to /var/cache/conftool/dbconfig/20241011-012424-arnaudb.json [01:25:56] RESOLVED: [2x] RdfStreamingUpdaterFlinkProcessingLatencyIsHigh: Processing latency of WDQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [01:26:36] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2237 (T367781)', diff saved to https://phabricator.wikimedia.org/P69670 and previous config saved to /var/cache/conftool/dbconfig/20241011-012635-arnaudb.json [01:41:42] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P69671 and previous config saved to /var/cache/conftool/dbconfig/20241011-014142-arnaudb.json [01:42:25] RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:56:49] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P69672 and previous config saved to /var/cache/conftool/dbconfig/20241011-015649-arnaudb.json [02:11:56] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2237 (T367781)', diff saved to https://phabricator.wikimedia.org/P69673 and previous config saved to /var/cache/conftool/dbconfig/20241011-021156-arnaudb.json [02:12:00] T367781: Drop deprecated abuse filter fields on wmf wikis - https://phabricator.wikimedia.org/T367781 [02:37:12] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:57:12] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:19:50] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10220468 (10phaultfinder) [03:49:50] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10220471 (10phaultfinder) [03:57:04] (03PS1) 10Krinkle: Revert "codesearch: add ports for design and discovery" [puppet] - 10https://gerrit.wikimedia.org/r/1079382 [04:39:42] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10220481 (10phaultfinder) [05:05:26] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2003:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241011T0600) [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1006:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:09:45] (03CR) 10Ayounsi: [C:03+1] Remove pfw3 and add pfw1 [puppet] - 10https://gerrit.wikimedia.org/r/1079364 (https://phabricator.wikimedia.org/T374176) (owner: 10Papaul) [06:40:21] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1079151 (https://phabricator.wikimedia.org/T376871) (owner: 10Slyngshede) [06:41:39] (03PS5) 10Muehlenhoff: peopleweb: limit envoy srange to CACHES and DEPLOYMENT servers [puppet] - 10https://gerrit.wikimedia.org/r/1071927 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [06:41:46] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1071927 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [06:48:56] (03CR) 10Slyngshede: [C:03+2] R:idmcloud remove role, service will be moved to a different role. [puppet] - 10https://gerrit.wikimedia.org/r/1079151 (https://phabricator.wikimedia.org/T376871) (owner: 10Slyngshede) [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241011T0700) [07:01:01] (03PS1) 10Muehlenhoff: tlsproxy::envoy: Simplify firewall rule set [puppet] - 10https://gerrit.wikimedia.org/r/1079395 [07:13:00] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1079395 (owner: 10Muehlenhoff) [07:32:26] (03CR) 10Jon Harald Søby: "The Palatine German Wikipedia (pflwiki) has sister-projects-in-Wikipedia for lots of projects, so it might be a good idea to add them to t" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078122 (https://phabricator.wikimedia.org/T249648) (owner: 10Pppery) [07:34:59] 07sre-alert-triage, 06Data-Platform-SRE: Alert in need of triage: PrometheusMysqldExporterFailed (instance dbstore1009:13350) - https://phabricator.wikimedia.org/T376977 (10LSobanski) 03NEW [07:35:29] 07sre-alert-triage, 06DBA: Alert in need of triage: PrometheusMysqldExporterFailed (instance db1208:13351) - https://phabricator.wikimedia.org/T376978 (10LSobanski) 03NEW [07:37:39] (03CR) 10Elukey: [C:03+2] profile::docker::reporter: avoid OCI indexes for k8s images [puppet] - 10https://gerrit.wikimedia.org/r/1079291 (owner: 10Elukey) [07:50:39] (03CR) 10Brouberol: "You also need to add the following block to `hieradata/role/common/cache/text.yaml`, to make sure the assets are not cached by the CDN, an" [puppet] - 10https://gerrit.wikimedia.org/r/1079361 (https://phabricator.wikimedia.org/T374948) (owner: 10Bking) [07:53:12] (03PS1) 10Giuseppe Lavagetto: mediawiki: add mw.name helper [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079443 [07:53:14] (03PS1) 10Giuseppe Lavagetto: mw-script: deduplicate resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079445 (https://phabricator.wikimedia.org/T376795) [07:54:10] (03CR) 10CI reject: [V:04-1] mediawiki: deduplicate network policies and configmaps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079444 (https://phabricator.wikimedia.org/T376795) (owner: 10Giuseppe Lavagetto) [07:55:01] (03CR) 10CI reject: [V:04-1] mw-script: deduplicate resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079445 (https://phabricator.wikimedia.org/T376795) (owner: 10Giuseppe Lavagetto) [07:56:09] !log stevemunene@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1177.eqiad.wmnet with OS bullseye [07:59:00] !log stevemunene@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1177.eqiad.wmnet with OS bullseye [08:00:04] !log upload ircstream 0.13.0+wmf12u2 to apt.wikimedia.org (sync to latest git and the async_broadcast feature branch) T376014 [08:00:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:19] T376014: Create and deploy a re-reimplementation of irc.wikimedia.org in Python 3 without external service deps - https://phabricator.wikimedia.org/T376014 [08:02:47] !log stevemunene@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1176.eqiad.wmnet with OS bullseye [08:04:39] (03CR) 10Muehlenhoff: "https://gerrit.wikimedia.org/r/c/operations/puppet/+/1079395 should fix this" [puppet] - 10https://gerrit.wikimedia.org/r/1071926 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [08:04:55] (03CR) 10Muehlenhoff: "https://gerrit.wikimedia.org/r/c/operations/puppet/+/1079395 should fix this" [puppet] - 10https://gerrit.wikimedia.org/r/1071927 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [08:10:14] !log jelto@deploy2002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [08:10:15] !log stevemunene@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1177.eqiad.wmnet with OS bullseye [08:12:53] !log jelto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [08:16:57] (03CR) 10Brouberol: [C:03+2] Define a ceph rolling restart/reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1078959 (https://phabricator.wikimedia.org/T375071) (owner: 10Brouberol) [08:17:50] !log jelto@deploy2002 helmfile [codfw] START helmfile.d/services/miscweb: apply [08:19:08] (03CR) 10Giuseppe Lavagetto: [C:03+2] fastapi: Add define to run a fastapi application [puppet] - 10https://gerrit.wikimedia.org/r/1078708 (https://phabricator.wikimedia.org/T371782) (owner: 10Giuseppe Lavagetto) [08:19:25] (03CR) 10Giuseppe Lavagetto: [C:03+2] profile::conftool: add web interface for requestctl [puppet] - 10https://gerrit.wikimedia.org/r/1078709 (https://phabricator.wikimedia.org/T371782) (owner: 10Giuseppe Lavagetto) [08:19:44] !log jelto@deploy2002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [08:19:54] (03CR) 10Giuseppe Lavagetto: [C:03+2] hiddenparma: add to deployment server [puppet] - 10https://gerrit.wikimedia.org/r/1078983 (https://phabricator.wikimedia.org/T371782) (owner: 10Giuseppe Lavagetto) [08:38:06] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: don't enable sysctl rp_filter for IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/1079449 (https://phabricator.wikimedia.org/T374716) [08:44:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10220686 (10phaultfinder) [08:47:24] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1079449 (https://phabricator.wikimedia.org/T374716) (owner: 10Arturo Borrero Gonzalez) [08:59:31] (03PS2) 10Btullis: cephosd: Switch to use nftables instead of iptables [puppet] - 10https://gerrit.wikimedia.org/r/1050331 (https://phabricator.wikimedia.org/T327259) [09:00:41] (03PS3) 10Btullis: cephosd: Switch to use nftables instead of iptables [puppet] - 10https://gerrit.wikimedia.org/r/1050331 (https://phabricator.wikimedia.org/T327259) [09:01:35] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4280/co" [puppet] - 10https://gerrit.wikimedia.org/r/1050331 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [09:02:04] (03PS4) 10Btullis: cephosd: Switch to use nftables instead of iptables [puppet] - 10https://gerrit.wikimedia.org/r/1050331 (https://phabricator.wikimedia.org/T327259) [09:02:49] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4281/co" [puppet] - 10https://gerrit.wikimedia.org/r/1050331 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [09:04:10] (03PS1) 10Klausman: admin/home/klausman: add no_proxy to bashrc functions [puppet] - 10https://gerrit.wikimedia.org/r/1079450 [09:05:22] (03CR) 10FNegri: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1078954 (https://phabricator.wikimedia.org/T376719) (owner: 10Arturo Borrero Gonzalez) [09:05:26] FIRING: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2003:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [09:06:03] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] wmcs: declare prometheus::node_kernel_panic in profile::base::cloud_production [puppet] - 10https://gerrit.wikimedia.org/r/1078954 (https://phabricator.wikimedia.org/T376719) (owner: 10Arturo Borrero Gonzalez) [09:13:05] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1209 (T367856)', diff saved to https://phabricator.wikimedia.org/P69674 and previous config saved to /var/cache/conftool/dbconfig/20241011-091305-ladsgroup.json [09:13:09] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [09:18:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:18:58] !log stevemunene@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1176.eqiad.wmnet with OS bullseye [09:19:44] It appears that gerrit.wikimedia.org is down again [09:19:53] indeed [09:20:04] stopped me from reviewing a patch :D [09:20:17] Back up for me [09:20:30] now it works [09:23:21] (03CR) 10Brouberol: [C:03+1] cephosd: Switch to use nftables instead of iptables [puppet] - 10https://gerrit.wikimedia.org/r/1050331 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [09:23:23] (03PS1) 10Giuseppe Lavagetto: Tweak deploy makefile, artifacts [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1079451 [09:23:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:24:17] (03PS2) 10Arturo Borrero Gonzalez: cloudgw: don't enable sysctl rp_filter for IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/1079449 (https://phabricator.wikimedia.org/T374716) [09:25:26] RESOLVED: RoutinatorRTRConnections: Important drop of Routinator RTR connections on rpki2003:9556 - https://wikitech.wikimedia.org/wiki/RPKI#RTR_Connections_drop - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRTRConnections [09:26:10] I'll check whats going on with gerrit [09:26:11] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] Tweak deploy makefile, artifacts [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1079451 (owner: 10Giuseppe Lavagetto) [09:26:22] (03PS3) 10Arturo Borrero Gonzalez: cloudgw: fix IPv6 settings [puppet] - 10https://gerrit.wikimedia.org/r/1079449 (https://phabricator.wikimedia.org/T374716) [09:26:24] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] Add scap files [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1079452 (owner: 10Giuseppe Lavagetto) [09:26:44] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] Add first version to deploy of hiddenparma [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1078435 (https://phabricator.wikimedia.org/T371782) (owner: 10Giuseppe Lavagetto) [09:27:07] !log Restarted MediaModeration scanning script - https://wikitech.wikimedia.org/wiki/MediaModeration [09:27:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:12] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P69675 and previous config saved to /var/cache/conftool/dbconfig/20241011-092812-ladsgroup.json [09:29:22] (03CR) 10Ayounsi: [C:03+2] sre.hosts.dhcp: add --force-dhcp-tftp [cookbooks] - 10https://gerrit.wikimedia.org/r/1079224 (owner: 10Ayounsi) [09:30:42] (03PS1) 10Alexandros Kosiaris: thanos: Add a recording rule for PHP FPM workers [puppet] - 10https://gerrit.wikimedia.org/r/1079453 [09:31:01] (03CR) 10Gmodena: "LGTM! left you a question and two nits." [puppet] - 10https://gerrit.wikimedia.org/r/1074414 (https://phabricator.wikimedia.org/T370668) (owner: 10Fabfur) [09:31:45] (03CR) 10Gmodena: "@otto @joal: tagging you folks since we recently discussed this schema definition in slack." [puppet] - 10https://gerrit.wikimedia.org/r/1074414 (https://phabricator.wikimedia.org/T370668) (owner: 10Fabfur) [09:35:09] (03CR) 10Btullis: [V:03+1 C:03+2] cephosd: Switch to use nftables instead of iptables [puppet] - 10https://gerrit.wikimedia.org/r/1050331 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [09:35:15] (03CR) 10Ayounsi: "Not sure if I fully understand. As long as there are contributing more specific prefixes, the aggregate should become "active" and be the " [homer/public] - 10https://gerrit.wikimedia.org/r/1079288 (https://phabricator.wikimedia.org/T245495) (owner: 10Cathal Mooney) [09:35:46] (03CR) 10Alexandros Kosiaris: [C:03+2] Remove scandium from puppet [puppet] - 10https://gerrit.wikimedia.org/r/1024402 (https://phabricator.wikimedia.org/T376632) (owner: 10Alexandros Kosiaris) [09:36:28] (03CR) 10Cathal Mooney: cloudgw: fix IPv6 settings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1079449 (https://phabricator.wikimedia.org/T374716) (owner: 10Arturo Borrero Gonzalez) [09:38:43] !log akosiaris@cumin1002 START - Cookbook sre.hosts.decommission for hosts scandium.eqiad.wmnet [09:38:58] 06SRE, 06Infrastructure-Foundations, 06serviceops: Clean up the Docker Registry catalog and Swift storage from old images - https://phabricator.wikimedia.org/T375645#10220815 (10elukey) I've created a Python script to dry-run what I highlighted above, this is how it would look like: ==== No tags, registryct... [09:40:49] (03CR) 10Klausman: [C:03+2] admin/home/klausman: add no_proxy to bashrc functions [puppet] - 10https://gerrit.wikimedia.org/r/1079450 (owner: 10Klausman) [09:41:00] !log brouberol@cumin1002 START - Cookbook sre.ceph.roll-restart-reboot-server rolling reboot on A:cephosd [09:41:45] (03PS1) 10Ayounsi: Routed ganeti: fix incorrect v6 forwarding keyword [puppet] - 10https://gerrit.wikimedia.org/r/1079455 [09:41:58] (03PS4) 10Arturo Borrero Gonzalez: cloudgw: fix IPv6 settings [puppet] - 10https://gerrit.wikimedia.org/r/1079449 (https://phabricator.wikimedia.org/T374716) [09:42:23] (03CR) 10Arturo Borrero Gonzalez: cloudgw: fix IPv6 settings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1079449 (https://phabricator.wikimedia.org/T374716) (owner: 10Arturo Borrero Gonzalez) [09:42:25] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1079449 (https://phabricator.wikimedia.org/T374716) (owner: 10Arturo Borrero Gonzalez) [09:42:32] (03Merged) 10jenkins-bot: sre.hosts.dhcp: add --force-dhcp-tftp [cookbooks] - 10https://gerrit.wikimedia.org/r/1079224 (owner: 10Ayounsi) [09:43:19] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P69676 and previous config saved to /var/cache/conftool/dbconfig/20241011-094319-ladsgroup.json [09:43:36] (03CR) 10Cathal Mooney: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1079449 (https://phabricator.wikimedia.org/T374716) (owner: 10Arturo Borrero Gonzalez) [09:43:47] (03CR) 10Fabfur: Renamed log fields for pipeline migration (haproxykafka) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1074414 (https://phabricator.wikimedia.org/T370668) (owner: 10Fabfur) [09:44:42] (03PS3) 10Fabfur: Renamed log fields for pipeline migration (haproxykafka) [puppet] - 10https://gerrit.wikimedia.org/r/1074414 (https://phabricator.wikimedia.org/T370668) [09:45:43] (03CR) 10Fabfur: "Related changes in https://gitlab.wikimedia.org/repos/sre/haproxykafka/-/merge_requests/68" [puppet] - 10https://gerrit.wikimedia.org/r/1074414 (https://phabricator.wikimedia.org/T370668) (owner: 10Fabfur) [09:51:00] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] cloudgw: fix IPv6 settings [puppet] - 10https://gerrit.wikimedia.org/r/1079449 (https://phabricator.wikimedia.org/T374716) (owner: 10Arturo Borrero Gonzalez) [09:52:09] (03CR) 10Gmodena: Renamed log fields for pipeline migration (haproxykafka) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1074414 (https://phabricator.wikimedia.org/T370668) (owner: 10Fabfur) [09:54:48] FIRING: PuppetFailure: Puppet has failed on deploy2002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [09:56:08] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1079455 (owner: 10Ayounsi) [09:58:26] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1209 (T367856)', diff saved to https://phabricator.wikimedia.org/P69677 and previous config saved to /var/cache/conftool/dbconfig/20241011-095826-ladsgroup.json [09:58:28] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 12:00:00 on db1211.eqiad.wmnet with reason: Maintenance [09:58:30] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [09:58:41] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 12:00:00 on db1211.eqiad.wmnet with reason: Maintenance [09:58:48] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1211 (T367856)', diff saved to https://phabricator.wikimedia.org/P69678 and previous config saved to /var/cache/conftool/dbconfig/20241011-095847-ladsgroup.json [10:02:26] (03PS25) 10Arnaudb: mariadb: clone cookbook maintenance [cookbooks] - 10https://gerrit.wikimedia.org/r/1071155 (https://phabricator.wikimedia.org/T374191) [10:05:49] (03PS1) 10JMeybohm: Add missing test for CalicoKubeControllersDown [alerts] - 10https://gerrit.wikimedia.org/r/1079456 [10:05:49] (03PS1) 10JMeybohm: Add CalicoHighMemoryUsage alert [alerts] - 10https://gerrit.wikimedia.org/r/1079457 (https://phabricator.wikimedia.org/T376976) [10:07:38] (03CR) 10CI reject: [V:04-1] Add CalicoHighMemoryUsage alert [alerts] - 10https://gerrit.wikimedia.org/r/1079457 (https://phabricator.wikimedia.org/T376976) (owner: 10JMeybohm) [10:12:03] (03PS5) 10Tiziano Fogli: icinga: disable shard check logstash cluster [puppet] - 10https://gerrit.wikimedia.org/r/1063986 (https://phabricator.wikimedia.org/T371083) [10:12:44] (03PS1) 10JMeybohm: Remove memory limits from calico components in wikikube clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079459 (https://phabricator.wikimedia.org/T376976) [10:14:48] FIRING: [2x] PuppetFailure: Puppet has failed on deploy1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [10:15:46] (03PS2) 10JMeybohm: Add CalicoHighMemoryUsage alert [alerts] - 10https://gerrit.wikimedia.org/r/1079457 (https://phabricator.wikimedia.org/T376976) [10:15:51] (03CR) 10CI reject: [V:04-1] Remove memory limits from calico components in wikikube clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079459 (https://phabricator.wikimedia.org/T376976) (owner: 10JMeybohm) [10:19:20] (03CR) 10Tiziano Fogli: [C:03+2] icinga: disable shard check logstash cluster [puppet] - 10https://gerrit.wikimedia.org/r/1063986 (https://phabricator.wikimedia.org/T371083) (owner: 10Tiziano Fogli) [10:20:31] (03PS2) 10JMeybohm: Remove memory limits from calico components in wikikube clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079459 (https://phabricator.wikimedia.org/T376976) [10:23:44] !log cgoubert@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker2092.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTARTand with Dell SCP reboot policy GRACEFUL [10:24:18] !incidents [10:24:18] 5310 (UNACKED) db2175 (paged)/MariaDB Replica SQL: s2 (paged) [10:24:19] 5309 (RESOLVED) ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad) [10:24:29] checking [10:24:55] !ack 5310 [10:24:55] 5310 (ACKED) db2175 (paged)/MariaDB Replica SQL: s2 (paged) [10:25:14] should not be an impactful issue [10:25:21] it is a regular replica [10:25:58] okay [10:25:58] I'm a bit surprised it did not pa.ge here in irc but the bot is probably down [10:26:27] !log disabling puppet on R:acme_chief::cert for T376800 [10:26:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:46] it was pooled, depooling [10:27:06] !log jynus@cumin1002 dbctl commit (dc=all): 'depool db2175', diff saved to https://phabricator.wikimedia.org/P69680 and previous config saved to /var/cache/conftool/dbconfig/20241011-102706-jynus.json [10:27:21] will see if it is a load issue or an instance issue [10:27:34] thanks for acking, it didn't show up on VO for me [10:27:35] load dropped to 0% [10:27:48] https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=db2175&var-datasource=thanos&var-cluster=mysql [10:28:16] mw automatically depools it, it is just the checks that creat normalyl log spam, and flopping if not depooled [10:28:43] the thing is, will it make the others bad, or will it fix [10:28:49] (03CR) 10Kamila Součková: [C:03+1] "LGTM but see inline" [alerts] - 10https://gerrit.wikimedia.org/r/1079457 (https://phabricator.wikimedia.org/T376976) (owner: 10JMeybohm) [10:29:44] megacli: command not found [10:30:17] that's weird [10:30:22] (03PS15) 10Tiziano Fogli: atlas: adding prometheus blackbox icmp checks [puppet] - 10https://gerrit.wikimedia.org/r/1079226 (https://phabricator.wikimedia.org/T370506) [10:30:50] that could point to a bad install- no raid tools, maybe learning cycle active? [10:31:17] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2092.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTARTand with Dell SCP reboot policy GRACEFUL [10:31:25] oh, it says not replicating, maybe it crashed? [10:31:41] I see, indec corrupt [10:31:44] will file a ticket [10:32:04] (03CR) 10Kamila Součková: [C:03+1] "I'm not 100% happy with this because now we're relying on alerts which may be hidden in the alert noise, but that's not a problem with thi" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079459 (https://phabricator.wikimedia.org/T376976) (owner: 10JMeybohm) [10:32:19] jelto: I think no more action needed, will file a ticket and repair the index rebuilding it CC arnaudb [10:32:21] !incidents [10:32:22] 5310 (ACKED) db2175 (paged)/MariaDB Replica SQL: s2 (paged) [10:32:22] 5311 (UNACKED) db2175 (paged)/MariaDB Replica Lag: s2 (paged) [10:32:22] 5309 (RESOLVED) ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad) [10:32:28] it pag.ed again [10:32:33] I will downtime it [10:32:34] !ack 5311 [10:32:35] 5311 (ACKED) db2175 (paged)/MariaDB Replica Lag: s2 (paged) [10:32:45] it will take a while to be fixed [10:33:08] (03CR) 10Kamila Součková: [C:03+1] Add CalicoHighMemoryUsage alert (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1079457 (https://phabricator.wikimedia.org/T376976) (owner: 10JMeybohm) [10:33:13] ack so depooled and downtime ? [10:33:32] !log jynus@cumin1002 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on db2175.codfw.wmnet with reason: index corruption [10:33:41] will show you what I ran [10:33:47] !log jynus@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2175.codfw.wmnet with reason: index corruption [10:33:54] dbctl instance db2175 depool [10:34:01] dbctl config commit -m "depool db2175" [10:34:13] then: cookbook sre.hosts.downtime --days 3 -r "index corruption" db2175.codfw.wmet [10:34:25] cookbook sre.hosts.downtime --days 3 -r "index corruption" db2175.codfw.wmnet [10:34:36] that makes sense, thanks for the explanation :) [10:34:55] !log disabled puppet on acmechief1002 (T376800) [10:34:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:08] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-worker2092.codfw.wmnet with OS bullseye [10:35:20] (03CR) 10Tiziano Fogli: [C:03+2] atlas: adding prometheus blackbox icmp checks [puppet] - 10https://gerrit.wikimedia.org/r/1079226 (https://phabricator.wikimedia.org/T370506) (owner: 10Tiziano Fogli) [10:35:21] 06SRE, 06serviceops, 13Patch-For-Review: mw2420-mw2451 do have unnecessary raid controllers (configured) - https://phabricator.wikimedia.org/T358489#10220912 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host wikikube-worker2092.codfw.wmnet with OS bullseye [10:35:31] jelto: those still p* because the automatic depool is not perfect [10:36:17] but when they get behind they already get thrown out of mw logic, just they are very spammy and sometimes depool is not the right fix (e.g. if it high load) [10:36:33] so usually those case very little impact [10:36:38] *cause [10:36:56] but sometimes they do, so we keep them p*ging [10:37:07] ok :) [10:37:07] Is it ok to leave the vops incidents open/acknoledged or will this pa.ge again during the weekend? With a downtime of 3 days it should fire again on Monday right? [10:37:22] yeah, I will try to get it fixed today [10:37:23] !log fabfur@cumin1002 START - Cookbook sre.hosts.reboot-single for host acmechief1002.eqiad.wmnet [10:37:41] but will take care of all the followups, including filing a ticket [10:37:52] great thanks a lot [10:37:59] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host acmechief1002.eqiad.wmnet [10:37:59] tecknically this would fall in the dba's respnsability [10:38:08] but I will do it because I know how :-D [10:39:08] but im this case, even if we ignored it, no user impact (or very little) [10:39:34] the issue is when we depool it and it is mw heavy querying, in which depooling only makes the problem worse, moving it around [10:39:47] hence the p*ging and the human invervention, if that makes sense [10:40:02] ideally, with better global automation, that can be less manual in the future [10:40:28] FIRING: KeyholderUnarmed: 1 unarmed Keyholder key(s) on acmechief1002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [10:41:04] That makes sense :) [10:41:31] I think normally depooling and repooling if it doesn't work is a good blind option [10:43:40] jelto: https://phabricator.wikimedia.org/T376988 [10:43:54] i just saw the page [10:44:04] thanks for filing the task [10:44:18] !log rebooting acmechief1002|2002 (sequentially) (T376800) [10:44:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:29] !log fabfur@cumin1002 START - Cookbook sre.hosts.reboot-single for host acmechief2002.codfw.wmnet [10:45:00] thanks for handling the depool 🙏 [10:45:28] RESOLVED: KeyholderUnarmed: 1 unarmed Keyholder key(s) on acmechief1002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [10:45:30] 'Index for table 'recentchanges' I'll deal with it after lunch [10:46:02] I started already the alter [10:46:09] will handover so you can do it [10:46:09] amazgin [10:46:23] ack, no problem :) [10:46:28] will do! [10:47:06] will claim until done, let the rest handled by you [10:47:10] (03CR) 10JMeybohm: Add CalicoHighMemoryUsage alert (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/1079457 (https://phabricator.wikimedia.org/T376976) (owner: 10JMeybohm) [10:47:38] !log fabfur@cumin1002 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host acmechief2002.codfw.wmnet [10:49:52] jelto: I thought it was going to take more, the fix took only 10 seconds, should resolve soon [10:50:04] (03PS3) 10JMeybohm: Add CalicoHighMemoryUsage alert [alerts] - 10https://gerrit.wikimedia.org/r/1079457 (https://phabricator.wikimedia.org/T376976) [10:50:05] !log cgoubert@cumin1002 START - Cookbook sre.dns.netbox [10:50:13] nice, it just resolved in victor ops [10:50:27] !log enabled puppet on R:acme_chief::cert for T376800 [10:50:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:41] !log brouberol@cumin1002 END (PASS) - Cookbook sre.ceph.roll-restart-reboot-server (exit_code=0) rolling reboot on A:cephosd [10:50:43] the problems is this is happening lately more frequently [10:50:44] the duplicate 5311 is still open, I'll wait a bit if it resolves as well otherwise I'll close it [10:50:58] !incidents [10:50:58] 5311 (RESOLVED) db2175 (paged)/MariaDB Replica Lag: s2 (paged) [10:50:58] 5310 (RESOLVED) db2175 (paged)/MariaDB Replica SQL: s2 (paged) [10:50:59] 5309 (RESOLVED) ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad) [10:50:59] it just did [10:51:02] great [10:51:23] (03CR) 10Gmodena: Renamed log fields for pipeline migration (haproxykafka) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1074414 (https://phabricator.wikimedia.org/T370668) (owner: 10Fabfur) [10:53:13] !log fabfur@cumin1002 START - Cookbook sre.hosts.reboot-single for host acmechief-test2001.codfw.wmnet [10:55:08] I'd sign off if all incidents were are smooth as those ones. I suggest not needing an incident dock because there would be hardly any user impact [10:55:14] *doc [10:55:34] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-worker2092.codfw.wmnet with reason: host reimage [10:55:37] however, I would like to see a more long term solution to the index corruption, either upstream or by our automation [10:56:46] !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [10:56:59] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host acmechief-test2001.codfw.wmnet [10:57:24] !log fabfur@cumin1002 START - Cookbook sre.hosts.reboot-single for host acmechief-test1001.eqiad.wmnet [10:58:52] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-worker2092.codfw.wmnet with reason: host reimage [10:59:03] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1211 (T367856)', diff saved to https://phabricator.wikimedia.org/P69682 and previous config saved to /var/cache/conftool/dbconfig/20241011-105903-ladsgroup.json [10:59:07] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [11:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241011T0700) [11:00:05] eoghan, jelto, arnoldokoth, and mutante: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) GitLab version upgrades deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20241011T1100). [11:02:55] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host acmechief-test1001.eqiad.wmnet [11:05:11] (03PS4) 10Fabfur: Renamed log fields for pipeline migration (haproxykafka) [puppet] - 10https://gerrit.wikimedia.org/r/1074414 (https://phabricator.wikimedia.org/T370668) [11:05:54] (03PS1) 10Jelto: miscweb: add support to mount optional confimaps [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079465 (https://phabricator.wikimedia.org/T350793) [11:05:56] (03PS1) 10Jelto: wikidata-query-gui: mount custom-config.json into pod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079466 (https://phabricator.wikimedia.org/T350793) [11:09:19] (03PS1) 10Giuseppe Lavagetto: deployment_server: add keyholder config for requestctl web UI [puppet] - 10https://gerrit.wikimedia.org/r/1079469 [11:10:19] (03CR) 10Giuseppe Lavagetto: [C:03+2] deployment_server: add keyholder config for requestctl web UI [puppet] - 10https://gerrit.wikimedia.org/r/1079469 (owner: 10Giuseppe Lavagetto) [11:11:48] (03CR) 10Fabfur: Renamed log fields for pipeline migration (haproxykafka) (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1074414 (https://phabricator.wikimedia.org/T370668) (owner: 10Fabfur) [11:14:10] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1211', diff saved to https://phabricator.wikimedia.org/P69683 and previous config saved to /var/cache/conftool/dbconfig/20241011-111410-ladsgroup.json [11:18:28] FIRING: KeyholderUnarmed: 1 unarmed Keyholder key(s) on deploy2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [11:20:01] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-worker2092.codfw.wmnet with OS bullseye [11:20:12] 06SRE, 06serviceops, 13Patch-For-Review: mw2420-mw2451 do have unnecessary raid controllers (configured) - https://phabricator.wikimedia.org/T358489#10221004 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikikube-worker2092.codfw.wmnet with OS bullseye com... [11:21:26] FIRING: [6x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:23:34] (03PS1) 10Slyngshede: Account blocking: Publically available log of all block and unblocks. [software/bitu] - 10https://gerrit.wikimedia.org/r/1079470 (https://phabricator.wikimedia.org/T376991) [11:26:28] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node pool for host wikikube-worker2092.codfw.wmnet [11:26:30] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host wikikube-worker2092.codfw.wmnet [11:27:15] !log cgoubert@cumin1002 START - Cookbook sre.hosts.remove-downtime for wikikube-worker2092.codfw.wmnet [11:27:15] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for wikikube-worker2092.codfw.wmnet [11:28:28] FIRING: [2x] KeyholderUnarmed: 1 unarmed Keyholder key(s) on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [11:28:39] (03CR) 10JMeybohm: [C:03+2] Add CalicoHighMemoryUsage alert (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1079457 (https://phabricator.wikimedia.org/T376976) (owner: 10JMeybohm) [11:28:41] (03CR) 10JMeybohm: [C:03+2] Add missing test for CalicoKubeControllersDown [alerts] - 10https://gerrit.wikimedia.org/r/1079456 (owner: 10JMeybohm) [11:28:49] 06SRE, 06serviceops, 13Patch-For-Review: mw2420-mw2451 do have unnecessary raid controllers (configured) - https://phabricator.wikimedia.org/T358489#10221059 (10Clement_Goubert) [11:29:17] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1211', diff saved to https://phabricator.wikimedia.org/P69684 and previous config saved to /var/cache/conftool/dbconfig/20241011-112917-ladsgroup.json [11:29:37] 06SRE, 06serviceops, 13Patch-For-Review: mw2420-mw2451 do have unnecessary raid controllers (configured) - https://phabricator.wikimedia.org/T358489#10221064 (10Clement_Goubert) 05Open→03In progress [11:29:52] (03Merged) 10jenkins-bot: Add missing test for CalicoKubeControllersDown [alerts] - 10https://gerrit.wikimedia.org/r/1079456 (owner: 10JMeybohm) [11:29:56] 10ops-codfw, 06DC-Ops, 10Prod-Kubernetes, 06serviceops: Degraded RAID on wikikube-worker2092 - https://phabricator.wikimedia.org/T374409#10221060 (10Clement_Goubert) 05In progress→03Resolved Hardware RAID removed, server reimaged and repooled. [11:30:07] (03CR) 10JMeybohm: [C:03+2] Remove memory limits from calico components in wikikube clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079459 (https://phabricator.wikimedia.org/T376976) (owner: 10JMeybohm) [11:30:34] (03Merged) 10jenkins-bot: Add CalicoHighMemoryUsage alert [alerts] - 10https://gerrit.wikimedia.org/r/1079457 (https://phabricator.wikimedia.org/T376976) (owner: 10JMeybohm) [11:32:59] (03PS1) 10Hashar: ci: fix git mirror not fetching branches [puppet] - 10https://gerrit.wikimedia.org/r/1079472 (https://phabricator.wikimedia.org/T376981) [11:33:28] RESOLVED: [2x] KeyholderUnarmed: 1 unarmed Keyholder key(s) on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [11:36:38] !log akosiaris@cumin1002 START - Cookbook sre.dns.netbox [11:39:47] (03PS1) 10Cyndywikime: Update stream configuration to capture user id [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079475 (https://phabricator.wikimedia.org/T376833) [11:39:56] (03CR) 10Ayounsi: [C:03+2] Routed ganeti: fix incorrect v6 forwarding keyword [puppet] - 10https://gerrit.wikimedia.org/r/1079455 (owner: 10Ayounsi) [11:42:11] (03CR) 10Hashar: "I have applied the patch on the integration Puppet master and I have confirmed it now fetches both branches and tags :-] That will update" [puppet] - 10https://gerrit.wikimedia.org/r/1079472 (https://phabricator.wikimedia.org/T376981) (owner: 10Hashar) [11:42:17] (03CR) 10Hashar: [C:03+1] ci: fix git mirror not fetching branches [puppet] - 10https://gerrit.wikimedia.org/r/1079472 (https://phabricator.wikimedia.org/T376981) (owner: 10Hashar) [11:44:25] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1211 (T367856)', diff saved to https://phabricator.wikimedia.org/P69685 and previous config saved to /var/cache/conftool/dbconfig/20241011-114424-ladsgroup.json [11:44:26] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 12:00:00 on db1214.eqiad.wmnet with reason: Maintenance [11:44:28] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [11:44:39] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 12:00:00 on db1214.eqiad.wmnet with reason: Maintenance [11:44:47] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1214 (T367856)', diff saved to https://phabricator.wikimedia.org/P69686 and previous config saved to /var/cache/conftool/dbconfig/20241011-114446-ladsgroup.json [11:51:26] FIRING: [12x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:54:26] (03PS1) 10Arturo Borrero Gonzalez: prometheus-node-kernel-panic.sh: account for buster servers [puppet] - 10https://gerrit.wikimedia.org/r/1079480 (https://phabricator.wikimedia.org/T376719) [12:08:32] (03PS1) 10Btullis: Omit the hdfs_file test that uses a puppet:/// source [puppet] - 10https://gerrit.wikimedia.org/r/1079481 (https://phabricator.wikimedia.org/T323692) [12:12:22] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10221196 (10ayounsi) A few more reasons to upgrade in {T376986}. [12:13:26] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2175 (re)pooling @ 1%: T376988', diff saved to https://phabricator.wikimedia.org/P69687 and previous config saved to /var/cache/conftool/dbconfig/20241011-121325-arnaudb.json [12:13:29] T376988: db2175 replication breakage with: Error 'Index for table 'recentchanges' is corrupt; try to repair it' on query. Default database: 'nlwiki'. - https://phabricator.wikimedia.org/T376988 [12:17:03] (03CR) 10FNegri: [C:03+1] "Checked on a buster host, seems to work!" [puppet] - 10https://gerrit.wikimedia.org/r/1079480 (https://phabricator.wikimedia.org/T376719) (owner: 10Arturo Borrero Gonzalez) [12:17:52] (03CR) 10Btullis: [C:03+2] Omit the hdfs_file test that uses a puppet:/// source [puppet] - 10https://gerrit.wikimedia.org/r/1079481 (https://phabricator.wikimedia.org/T323692) (owner: 10Btullis) [12:18:13] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] prometheus-node-kernel-panic.sh: account for buster servers [puppet] - 10https://gerrit.wikimedia.org/r/1079480 (https://phabricator.wikimedia.org/T376719) (owner: 10Arturo Borrero Gonzalez) [12:28:31] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2175 (re)pooling @ 2%: T376988', diff saved to https://phabricator.wikimedia.org/P69688 and previous config saved to /var/cache/conftool/dbconfig/20241011-122830-arnaudb.json [12:28:34] T376988: db2175 replication breakage with: Error 'Index for table 'recentchanges' is corrupt; try to repair it' on query. Default database: 'nlwiki'. - https://phabricator.wikimedia.org/T376988 [12:30:58] (03PS1) 10Ayounsi: BGP_Customer_out: set No_smallnet6 to /49 and below [homer/public] - 10https://gerrit.wikimedia.org/r/1079487 [12:31:30] 07sre-alert-triage, 06DBA: Alert in need of triage: PrometheusMysqldExporterFailed (instance db1208:13351) - https://phabricator.wikimedia.org/T376978#10221241 (10ABran-WMF) p:05Triage→03Medium this falls under T371049 I think: https://phabricator.wikimedia.org/T371049#10021371 [12:31:42] 07sre-alert-triage, 06DBA: Alert in need of triage: PrometheusMysqldExporterFailed (instance db1208:13351) - https://phabricator.wikimedia.org/T376978#10221247 (10ABran-WMF) [12:33:34] hi, I have to restart Gerrit to clear out some cache and rediscover some repos [12:33:40] that will be for some minutes [12:34:08] !log akosiaris@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: scandium.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - akosiaris@cumin1002" [12:34:32] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: scandium.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - akosiaris@cumin1002" [12:34:32] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:34:32] !log akosiaris@cumin1002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts scandium.eqiad.wmnet [12:36:43] 10ops-eqiad, 06DC-Ops, 10decommission-hardware: decommission scandium - https://phabricator.wikimedia.org/T376632#10221261 (10akosiaris) [12:37:13] !log Restarting Gerrit [12:37:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:22] 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 06DC-Ops: Q1:rack/setup/install backup1012 - https://phabricator.wikimedia.org/T371416#10221263 (10elukey) @jcrespo the host is up and reimaged, but the mgmt interface is not reachable.. if you want to start configuring the host go ahead,... [12:38:02] Thanks for the info. [12:38:10] gerrit.service: Consumed 3month 4d 20h 22min 46.096s CPU time. [12:38:14] pour CPUs [12:38:17] wowee [12:38:33] how long since the last restart? [12:38:59] 2 months wall-clock time? https://sal.toolforge.org/log/gxO5MJEBKFqumxvtqUcU [12:40:04] I don't know [12:40:48] It appears back up for me now :) [12:42:27] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: set IPv6 forwarding in all interfaces [puppet] - 10https://gerrit.wikimedia.org/r/1079508 (https://phabricator.wikimedia.org/T374716) [12:43:03] (03CR) 10CI reject: [V:04-1] cloudgw: set IPv6 forwarding in all interfaces [puppet] - 10https://gerrit.wikimedia.org/r/1079508 (https://phabricator.wikimedia.org/T374716) (owner: 10Arturo Borrero Gonzalez) [12:43:36] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2175 (re)pooling @ 5%: T376988', diff saved to https://phabricator.wikimedia.org/P69690 and previous config saved to /var/cache/conftool/dbconfig/20241011-124336-arnaudb.json [12:43:40] T376988: db2175 replication breakage with: Error 'Index for table 'recentchanges' is corrupt; try to repair it' on query. Default database: 'nlwiki'. - https://phabricator.wikimedia.org/T376988 [12:44:08] (03PS2) 10Arturo Borrero Gonzalez: cloudgw: set IPv6 forwarding in all interfaces [puppet] - 10https://gerrit.wikimedia.org/r/1079508 (https://phabricator.wikimedia.org/T374716) [12:44:48] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1079508 (https://phabricator.wikimedia.org/T374716) (owner: 10Arturo Borrero Gonzalez) [12:45:04] 10ops-eqiad, 06DC-Ops, 10decommission-hardware: decommission scandium - https://phabricator.wikimedia.org/T376632#10221277 (10akosiaris) @ssastry, @Arlolra, fyi scandium is no more, may it RIP. [12:46:39] (03PS1) 10Giuseppe Lavagetto: Add dsh group to allow scap deploy --init to work [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1079510 [12:46:52] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] Add dsh group to allow scap deploy --init to work [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1079510 (owner: 10Giuseppe Lavagetto) [12:48:39] (03PS3) 10Arturo Borrero Gonzalez: cloudgw: set IPv6 forwarding in all interfaces [puppet] - 10https://gerrit.wikimedia.org/r/1079508 (https://phabricator.wikimedia.org/T374716) [12:48:49] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1079508 (https://phabricator.wikimedia.org/T374716) (owner: 10Arturo Borrero Gonzalez) [12:49:37] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10221296 (10phaultfinder) [12:56:06] (03PS4) 10Arturo Borrero Gonzalez: cloudgw: set IPv6 forwarding in all interfaces [puppet] - 10https://gerrit.wikimedia.org/r/1079508 (https://phabricator.wikimedia.org/T374716) [12:58:42] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2175 (re)pooling @ 10%: T376988', diff saved to https://phabricator.wikimedia.org/P69691 and previous config saved to /var/cache/conftool/dbconfig/20241011-125841-arnaudb.json [12:58:45] T376988: db2175 replication breakage with: Error 'Index for table 'recentchanges' is corrupt; try to repair it' on query. Default database: 'nlwiki'. - https://phabricator.wikimedia.org/T376988 [13:04:48] FIRING: [2x] PuppetFailure: Puppet has failed on deploy1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [13:08:00] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host cloudlb2004-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTARTand with Dell SCP reboot policy FORCED [13:11:42] (03PS2) 10Arnaudb: mysql_legacy: double quote escape in run_query [software/spicerack] - 10https://gerrit.wikimedia.org/r/1078658 (https://phabricator.wikimedia.org/T376712) [13:12:10] (03CR) 10Arnaudb: "I've added a tiny new thing" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1078658 (https://phabricator.wikimedia.org/T376712) (owner: 10Arnaudb) [13:12:37] !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "renamed k8s prefixes descriptions in Netbox - ayounsi@cumin1002" [13:13:08] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "renamed k8s prefixes descriptions in Netbox - ayounsi@cumin1002" [13:13:47] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2175 (re)pooling @ 25%: T376988', diff saved to https://phabricator.wikimedia.org/P69692 and previous config saved to /var/cache/conftool/dbconfig/20241011-131347-arnaudb.json [13:13:55] T376988: db2175 replication breakage with: Error 'Index for table 'recentchanges' is corrupt; try to repair it' on query. Default database: 'nlwiki'. - https://phabricator.wikimedia.org/T376988 [13:17:31] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: refresh forwarding firewall to accomodate for IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/1079515 (https://phabricator.wikimedia.org/T374716) [13:17:47] (03PS2) 10Arturo Borrero Gonzalez: cloudgw: refresh forwarding firewall to accomodate for IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/1079515 (https://phabricator.wikimedia.org/T374716) [13:22:40] (03CR) 10Papaul: [C:03+2] Remove pfw3 and add pfw1 [puppet] - 10https://gerrit.wikimedia.org/r/1079364 (https://phabricator.wikimedia.org/T374176) (owner: 10Papaul) [13:24:48] RESOLVED: PuppetFailure: Puppet has failed on deploy1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [13:25:46] (03PS1) 10Daimona Eaytoy: [uawikimedia] Enable the CampaignEvents extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079518 (https://phabricator.wikimedia.org/T376695) [13:26:27] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, October 14 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079518 (https://phabricator.wikimedia.org/T376695) (owner: 10Daimona Eaytoy) [13:28:37] (03PS2) 10Papaul: Add new frack switches for monitoring. [puppet] - 10https://gerrit.wikimedia.org/r/1075624 (https://phabricator.wikimedia.org/T374587) [13:28:52] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2175 (re)pooling @ 50%: T376988', diff saved to https://phabricator.wikimedia.org/P69693 and previous config saved to /var/cache/conftool/dbconfig/20241011-132852-arnaudb.json [13:28:59] T376988: db2175 replication breakage with: Error 'Index for table 'recentchanges' is corrupt; try to repair it' on query. Default database: 'nlwiki'. - https://phabricator.wikimedia.org/T376988 [13:29:24] (03CR) 10Ayounsi: [C:03+1] Add new frack switches for monitoring. [puppet] - 10https://gerrit.wikimedia.org/r/1075624 (https://phabricator.wikimedia.org/T374587) (owner: 10Papaul) [13:31:26] (03PS1) 10Zabe: s7: Reduce revision-slots cache expiry to 60 seconds [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079520 (https://phabricator.wikimedia.org/T183490) [13:32:04] (03PS3) 10Arnaudb: mariadb: add data directory accessor [software/spicerack] - 10https://gerrit.wikimedia.org/r/1078616 (https://phabricator.wikimedia.org/T376701) [13:32:27] Please ignore the 'ripe-atlas-.*:0 has failed probes' alerts. I'm working on migrating them to Prometheus, but it seems something is not working as expected. Sorry for the noise. [13:33:22] (03PS5) 10Arturo Borrero Gonzalez: cloudgw: set IPv6 forwarding in all interfaces [puppet] - 10https://gerrit.wikimedia.org/r/1079508 (https://phabricator.wikimedia.org/T374716) [13:33:35] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 3 others: codfw:frack:rack/install/configuration new firewalls - https://phabricator.wikimedia.org/T374176#10221410 (10Papaul) [13:34:02] (03PS6) 10Arturo Borrero Gonzalez: cloudgw: set IPv6 forwarding in all interfaces [puppet] - 10https://gerrit.wikimedia.org/r/1079508 (https://phabricator.wikimedia.org/T374716) [13:35:14] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1079508 (https://phabricator.wikimedia.org/T374716) (owner: 10Arturo Borrero Gonzalez) [13:35:15] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 3 others: codfw:frack:rack/install/configuration new firewalls - https://phabricator.wikimedia.org/T374176#10221411 (10Papaul) 05Open→03Resolved This is now complete the new firewall is in place and in production [13:39:24] (03PS1) 10Daimona Eaytoy: [wikidatawiki] Enable the CampaignEvents extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079521 (https://phabricator.wikimedia.org/T375411) [13:39:40] (03PS1) 10DCausse: wdqs: better filtering of monitoring queries [puppet] - 10https://gerrit.wikimedia.org/r/1079522 [13:39:52] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, October 15 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079521 (https://phabricator.wikimedia.org/T375411) (owner: 10Daimona Eaytoy) [13:41:39] (03PS2) 10Daimona Eaytoy: [wikidatawiki] Enable the CampaignEvents extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079521 (https://phabricator.wikimedia.org/T375411) [13:41:42] (03PS4) 10Arnaudb: mariadb: add data directory accessor [software/spicerack] - 10https://gerrit.wikimedia.org/r/1078616 (https://phabricator.wikimedia.org/T376701) [13:43:59] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2175 (re)pooling @ 75%: T376988', diff saved to https://phabricator.wikimedia.org/P69694 and previous config saved to /var/cache/conftool/dbconfig/20241011-134357-arnaudb.json [13:44:09] T376988: db2175 replication breakage with: Error 'Index for table 'recentchanges' is corrupt; try to repair it' on query. Default database: 'nlwiki'. - https://phabricator.wikimedia.org/T376988 [13:44:13] (03PS7) 10Arturo Borrero Gonzalez: cloudgw: set IPv6 forwarding in all interfaces [puppet] - 10https://gerrit.wikimedia.org/r/1079508 (https://phabricator.wikimedia.org/T374716) [13:44:50] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1079508 (https://phabricator.wikimedia.org/T374716) (owner: 10Arturo Borrero Gonzalez) [13:46:28] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudlb2004-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTARTand with Dell SCP reboot policy FORCED [13:47:41] (03CR) 10Alexandros Kosiaris: [C:03+2] mw-debug: Recreate instead of RollingUpdate [deployment-charts] - 10https://gerrit.wikimedia.org/r/1079283 (https://phabricator.wikimedia.org/T374907) (owner: 10Alexandros Kosiaris) [13:47:54] (03CR) 10Clément Goubert: [C:03+2] mw-debug-repl: Support next release [puppet] - 10https://gerrit.wikimedia.org/r/1079284 (https://phabricator.wikimedia.org/T376895) (owner: 10Clément Goubert) [13:51:13] (03PS1) 10Btullis: Omit the absented secret [puppet] - 10https://gerrit.wikimedia.org/r/1079524 (https://phabricator.wikimedia.org/T323692) [13:51:26] FIRING: [6x] ProbeDown: Service kubestagemaster2003:6443 has failed probes (http_staging_codfw_kube_apiserver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:52:23] (03PS2) 10Btullis: Omit the absented secret [puppet] - 10https://gerrit.wikimedia.org/r/1079524 (https://phabricator.wikimedia.org/T323692) [13:52:42] (03CR) 10Zabe: "Does that mean that https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1076995 also would have needed to wait until labtestwik" [puppet] - 10https://gerrit.wikimedia.org/r/1077345 (https://phabricator.wikimedia.org/T371592) (owner: 10Zabe) [13:53:11] (03CR) 10Arnaudb: "Done" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1078616 (https://phabricator.wikimedia.org/T376701) (owner: 10Arnaudb) [13:53:57] FIRING: KubernetesCalicoDown: kubestagemaster2005.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-staging&var-instance=kubestagemaster2005.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [13:53:57] FIRING: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [13:56:28] FIRING: KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [13:56:59] (03CR) 10Btullis: [C:03+2] Omit the absented secret [puppet] - 10https://gerrit.wikimedia.org/r/1079524 (https://phabricator.wikimedia.org/T323692) (owner: 10Btullis) [13:57:12] (03PS1) 10Ayounsi: Prefix validator: ensure k8s role and site [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1079525 (https://phabricator.wikimedia.org/T354169) [13:58:56] (03CR) 10CI reject: [V:04-1] Prefix validator: ensure k8s role and site [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1079525 (https://phabricator.wikimedia.org/T354169) (owner: 10Ayounsi) [13:58:57] FIRING: [2x] KubernetesCalicoDown: kubestagemaster2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [13:59:04] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2175 (re)pooling @ 100%: T376988', diff saved to https://phabricator.wikimedia.org/P69695 and previous config saved to /var/cache/conftool/dbconfig/20241011-135903-arnaudb.json [13:59:09] T376988: db2175 replication breakage with: Error 'Index for table 'recentchanges' is corrupt; try to repair it' on query. Default database: 'nlwiki'. - https://phabricator.wikimedia.org/T376988 [14:01:28] FIRING: [2x] KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [14:03:57] FIRING: [3x] KubernetesCalicoDown: kubestagemaster2003.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:06:28] FIRING: [2x] KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [14:07:40] FIRING: KubernetesRsyslogDown: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [14:11:28] FIRING: [2x] KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [14:12:40] RESOLVED: KubernetesRsyslogDown: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [14:13:57] FIRING: [3x] KubernetesCalicoDown: kubestagemaster2003.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:14:09] (03CR) 10Cathal Mooney: [C:03+1] "Yeah agreed, we shouldn't send anything smaller than a /48." [homer/public] - 10https://gerrit.wikimedia.org/r/1079487 (owner: 10Ayounsi) [14:16:28] RESOLVED: KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [14:21:55] FIRING: SystemdUnitFailed: imagecatalog_record.service on deploy2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:23:57] FIRING: [2x] KubernetesCalicoDown: kubestagemaster2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:24:28] FIRING: KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [14:25:18] (03CR) 10Cathal Mooney: "There is no aggregate config on the cloudsw for v6 right now, we could do it there however, and then send just the /48 which the existing " [homer/public] - 10https://gerrit.wikimedia.org/r/1079288 (https://phabricator.wikimedia.org/T245495) (owner: 10Cathal Mooney) [14:28:12] (03CR) 10Alexandros Kosiaris: [C:03+1] sre.discovery.datacenter: Add failover_from action [cookbooks] - 10https://gerrit.wikimedia.org/r/912813 (https://phabricator.wikimedia.org/T335364) (owner: 10Clément Goubert) [14:28:57] FIRING: [3x] KubernetesCalicoDown: kubestagemaster2003.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:30:53] !log andrewtavis-wmde@deploy2002 Started deploy [airflow-dags/wmde@c9a2532]: (no justification provided) [14:31:03] !log andrewtavis-wmde@deploy2002 Finished deploy [airflow-dags/wmde@c9a2532]: (no justification provided) (duration: 00m 25s) [14:34:27] (03PS3) 10JHathaway: resolvconf: add nameservr_ips [puppet] - 10https://gerrit.wikimedia.org/r/971409 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [14:34:59] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/971409 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [14:36:44] (03CR) 10Alexandros Kosiaris: [C:03+1] kubernetes: codfw refresh [puppet] - 10https://gerrit.wikimedia.org/r/1079239 (https://phabricator.wikimedia.org/T376170) (owner: 10Clément Goubert) [14:37:07] (03PS1) 10DCausse: rdf-streaming-updater: relax latency/unstability alerts [alerts] - 10https://gerrit.wikimedia.org/r/1079530 [14:37:12] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:37:54] (03CR) 10Alexandros Kosiaris: [C:03+1] kubernetes: codfw expansion [puppet] - 10https://gerrit.wikimedia.org/r/1079240 (https://phabricator.wikimedia.org/T376665) (owner: 10Clément Goubert) [14:38:11] (03CR) 10Alexandros Kosiaris: [C:03+1] kubernetes: eqiad refresh [puppet] - 10https://gerrit.wikimedia.org/r/1079241 (https://phabricator.wikimedia.org/T376185) (owner: 10Clément Goubert) [14:38:14] (03CR) 10Elukey: [C:03+1] Import ceph-csi-cephfs chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077872 (https://phabricator.wikimedia.org/T376406) (owner: 10Brouberol) [14:38:22] (03CR) 10Alexandros Kosiaris: [C:03+1] kubernetes: eqiad expansion [puppet] - 10https://gerrit.wikimedia.org/r/1079242 (https://phabricator.wikimedia.org/T376307) (owner: 10Clément Goubert) [14:38:23] !log eevans@deploy2002 helmfile [staging] START helmfile.d/services/data-gateway: apply [14:38:49] (03CR) 10Elukey: [C:03+1] Make it possible to deploy provisioner without the snahshotter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077873 (https://phabricator.wikimedia.org/T376406) (owner: 10Brouberol) [14:38:58] (03CR) 10Elukey: [C:03+1] Run the driver-registrar as root [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077874 (https://phabricator.wikimedia.org/T376406) (owner: 10Brouberol) [14:39:02] !log eevans@deploy2002 helmfile [staging] DONE helmfile.d/services/data-gateway: apply [14:39:59] (03CR) 10JHathaway: [C:03+2] resolvconf: add nameservr_ips [puppet] - 10https://gerrit.wikimedia.org/r/971409 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [14:40:41] (03CR) 10Elukey: "Should we set selinuxMount: false in values.yaml? Rather than doing it later on in helmfile etc.." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1077875 (https://phabricator.wikimedia.org/T376406) (owner: 10Brouberol) [14:41:56] (03CR) 10Cathal Mooney: [C:03+1] tlsproxy::envoy: Simplify firewall rule set [puppet] - 10https://gerrit.wikimedia.org/r/1079395 (owner: 10Muehlenhoff) [14:43:41] (03CR) 10Alexandros Kosiaris: mw-script: Add prometheus-statsd-exporter (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078666 (https://phabricator.wikimedia.org/T376714) (owner: 10Alexandros Kosiaris) [14:44:58] (03PS5) 10Alexandros Kosiaris: mw-script: Add prometheus-statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078666 (https://phabricator.wikimedia.org/T376714) [14:44:58] (03PS3) 10Alexandros Kosiaris: mw-script: Remove ci_only_release_do_not_deploy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078672 [14:46:10] (03CR) 10Alexandros Kosiaris: mw-script: Remove ci_only_release_do_not_deploy (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078672 (owner: 10Alexandros Kosiaris) [14:46:29] !log eevans@deploy2002 helmfile [codfw] START helmfile.d/services/data-gateway: apply [14:46:59] !log eevans@deploy2002 helmfile [codfw] DONE helmfile.d/services/data-gateway: apply [14:47:08] !log upgrading data-gateway to v1.0.10 [14:47:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:12] !log eevans@deploy2002 helmfile [eqiad] START helmfile.d/services/data-gateway: apply [14:48:43] !log eevans@deploy2002 helmfile [eqiad] DONE helmfile.d/services/data-gateway: apply [14:48:57] FIRING: [3x] KubernetesCalicoDown: kubestagemaster2003.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:49:28] FIRING: [2x] KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [14:51:28] (03PS8) 10Arturo Borrero Gonzalez: cloudgw: set IPv6 forwarding in all interfaces [puppet] - 10https://gerrit.wikimedia.org/r/1079508 (https://phabricator.wikimedia.org/T374716) [14:51:55] FIRING: [2x] SystemdUnitFailed: imagecatalog_record.service on deploy2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:51:57] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1079508 (https://phabricator.wikimedia.org/T374716) (owner: 10Arturo Borrero Gonzalez) [14:53:57] FIRING: [3x] KubernetesCalicoDown: kubestagemaster2003.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:54:11] (03PS1) 10Elukey: Add aux-k8s-etcd1004 in service [puppet] - 10https://gerrit.wikimedia.org/r/1079534 (https://phabricator.wikimedia.org/T344230) [14:54:13] (03PS1) 10Elukey: Add aux-k8s-etcd1005 in service [puppet] - 10https://gerrit.wikimedia.org/r/1079535 (https://phabricator.wikimedia.org/T344230) [14:54:28] FIRING: [3x] KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [14:56:24] (03PS2) 10Volans: sre.switchdc.databases.prepare: add check [cookbooks] - 10https://gerrit.wikimedia.org/r/1074127 (https://phabricator.wikimedia.org/T371351) [14:56:24] (03PS2) 10Volans: sre.switchdc.databases: update Phabricator more [cookbooks] - 10https://gerrit.wikimedia.org/r/1074128 (https://phabricator.wikimedia.org/T371351) [14:56:24] (03PS1) 10Volans: sre.switchdc.databases.prepare: fix heartbeat [cookbooks] - 10https://gerrit.wikimedia.org/r/1079536 (https://phabricator.wikimedia.org/T375144) [14:56:26] (03CR) 10Cathal Mooney: [C:03+2] BGP_Customer_out: set No_smallnet6 to /49 and below [homer/public] - 10https://gerrit.wikimedia.org/r/1079487 (owner: 10Ayounsi) [14:56:26] (03PS1) 10Volans: sre.switchdc.databases: allow to select a section [cookbooks] - 10https://gerrit.wikimedia.org/r/1079537 (https://phabricator.wikimedia.org/T375144) [14:56:26] FIRING: [6x] ProbeDown: Service kubestagemaster2003:6443 has failed probes (http_staging_codfw_kube_apiserver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:56:55] FIRING: [3x] SystemdUnitFailed: imagecatalog_record.service on deploy2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:57:02] (03Merged) 10jenkins-bot: BGP_Customer_out: set No_smallnet6 to /49 and below [homer/public] - 10https://gerrit.wikimedia.org/r/1079487 (owner: 10Ayounsi) [14:57:49] 10ops-eqiad, 06SRE, 06DC-Ops: hw troubleshooting: server failure for cloudvirt1063.eqiad.wmnet - https://phabricator.wikimedia.org/T375372#10221637 (10fnegri) Any updates on this host? Let me know if I can help with testing. [14:58:48] (03PS1) 10Elukey: Add aux-k8s-etcd1004 to the aux-k8s SRV records [dns] - 10https://gerrit.wikimedia.org/r/1079539 (https://phabricator.wikimedia.org/T344230) [14:58:49] (03PS1) 10Elukey: Add aux-k8s-etcd1005 to the Aux k8s SRV records [dns] - 10https://gerrit.wikimedia.org/r/1079540 (https://phabricator.wikimedia.org/T344230) [14:58:57] RESOLVED: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [14:59:28] RESOLVED: [3x] KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [14:59:49] (03PS2) 10Volans: sre.switchdc.databases: allow to select a section [cookbooks] - 10https://gerrit.wikimedia.org/r/1079537 (https://phabricator.wikimedia.org/T375144) [15:00:40] (03CR) 10Papaul: [C:03+2] Add new frack switches for monitoring. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1075624 (https://phabricator.wikimedia.org/T374587) (owner: 10Papaul) [15:00:58] 06SRE-OnFire, 06Data-Persistence-SRE, 06DBA, 13Patch-For-Review, 07Sustainability: ROW-based replicas broke with cleaned up heartbeat tables after setting up circular replication - https://phabricator.wikimedia.org/T375144#10221649 (10Volans) @jcrespo I had it almost finished yesterday but then I had to... [15:01:26] FIRING: [6x] ProbeDown: Service kubestagemaster2003:6443 has failed probes (http_staging_codfw_kube_apiserver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:02:12] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:02:27] (03CR) 10Scott French: "Thanks for confirming! Yeah, agreed that it's still good to have something like this for use when recovering from an emergency switchover." [cookbooks] - 10https://gerrit.wikimedia.org/r/912813 (https://phabricator.wikimedia.org/T335364) (owner: 10Clément Goubert) [15:02:40] FIRING: KubernetesRsyslogDown: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [15:03:57] FIRING: [3x] KubernetesCalicoDown: kubestagemaster2003.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:04:57] FIRING: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [15:06:55] FIRING: [4x] SystemdUnitFailed: imagecatalog_record.service on deploy2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:08:28] FIRING: KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [15:08:57] FIRING: [3x] KubernetesCalicoDown: kubestagemaster2003.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:09:57] RESOLVED: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [15:11:11] (03CR) 10Cathal Mooney: [C:03+1] cloudgw: set IPv6 forwarding in all interfaces [puppet] - 10https://gerrit.wikimedia.org/r/1079508 (https://phabricator.wikimedia.org/T374716) (owner: 10Arturo Borrero Gonzalez) [15:11:26] FIRING: [6x] ProbeDown: Service kubestagemaster2003:6443 has failed probes (http_staging_codfw_kube_apiserver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:11:47] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] cloudgw: set IPv6 forwarding in all interfaces [puppet] - 10https://gerrit.wikimedia.org/r/1079508 (https://phabricator.wikimedia.org/T374716) (owner: 10Arturo Borrero Gonzalez) [15:12:40] RESOLVED: KubernetesRsyslogDown: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [15:13:28] RESOLVED: KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [15:13:57] FIRING: [3x] KubernetesCalicoDown: kubestagemaster2003.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:16:07] (03PS3) 10Arturo Borrero Gonzalez: cloudgw: refresh forwarding firewall to accomodate for IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/1079515 (https://phabricator.wikimedia.org/T374716) [15:16:26] FIRING: [6x] ProbeDown: Service kubestagemaster2003:6443 has failed probes (http_staging_codfw_kube_apiserver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:16:55] FIRING: [2x] SystemdUnitFailed: imagecatalog_record.service on deploy2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:18:10] RESOLVED: [2x] SystemdUnitFailed: imagecatalog_record.service on deploy2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:18:39] (03PS4) 10Arturo Borrero Gonzalez: cloudgw: refresh forwarding firewall to accomodate for IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/1079515 (https://phabricator.wikimedia.org/T374716) [15:18:57] FIRING: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [15:18:57] FIRING: [3x] KubernetesCalicoDown: kubestagemaster2003.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:19:08] (03PS2) 10Tiziano Fogli: blackbox/icmp: deployment sites controlled by input parameter instead of ::site [puppet] - 10https://gerrit.wikimedia.org/r/1079531 (https://phabricator.wikimedia.org/T370506) [15:19:19] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1079515 (https://phabricator.wikimedia.org/T374716) (owner: 10Arturo Borrero Gonzalez) [15:19:49] (03CR) 10Cathal Mooney: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1079515 (https://phabricator.wikimedia.org/T374716) (owner: 10Arturo Borrero Gonzalez) [15:19:58] FIRING: [3x] KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [15:21:20] (03CR) 10Pppery: "That's done (with some others) in https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1079054/6" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078122 (https://phabricator.wikimedia.org/T249648) (owner: 10Pppery) [15:23:01] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] cloudgw: refresh forwarding firewall to accomodate for IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/1079515 (https://phabricator.wikimedia.org/T374716) (owner: 10Arturo Borrero Gonzalez) [15:23:15] 10ops-codfw, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker21[28-35] - https://phabricator.wikimedia.org/T377007 (10RobH) 03NEW [15:23:57] FIRING: [3x] KubernetesCalicoDown: kubestagemaster2003.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:24:28] (03CR) 10Tiziano Fogli: [C:03+2] atlas: adding prometheus blackbox icmp checks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1079226 (https://phabricator.wikimedia.org/T370506) (owner: 10Tiziano Fogli) [15:24:50] 10ops-codfw, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker21[28-35] - https://phabricator.wikimedia.org/T377007#10221735 (10RobH) [15:24:58] FIRING: [3x] KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [15:27:28] (03CR) 10Jon Harald Søby: "Nice! Thanks" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1078122 (https://phabricator.wikimedia.org/T249648) (owner: 10Pppery) [15:27:52] (03CR) 10Scott French: "Thanks, Hugh! A couple of mostly questions." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1077682 (https://phabricator.wikimedia.org/T371699) (owner: 10Hnowlan) [15:28:10] FIRING: [3x] SystemdUnitFailed: imagecatalog_record.service on deploy2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:28:57] FIRING: [2x] KubernetesCalicoDown: kubestagemaster2003.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:29:43] (03CR) 10Tiziano Fogli: alertmanager-irc: improve ErrorBudgetBurn SLO alert text (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1078718 (https://phabricator.wikimedia.org/T376740) (owner: 10Herron) [15:29:58] FIRING: [3x] KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [15:30:27] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 3 others: codfw:frack:rack/install/configuration new switches - https://phabricator.wikimedia.org/T374587#10221746 (10Papaul) [15:32:13] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host cloudlb2004-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTARTand with Dell SCP reboot policy FORCED [15:33:25] (03PS1) 10Btullis: Revert cephosd servers from nftables to ferm [puppet] - 10https://gerrit.wikimedia.org/r/1079545 (https://phabricator.wikimedia.org/T327259) [15:33:33] 10ops-codfw, 06DC-Ops, 06serviceops: Q2:rack/setup/install kubestage200[3-4] - https://phabricator.wikimedia.org/T377009 (10RobH) 03NEW [15:33:57] FIRING: [3x] KubernetesCalicoDown: kubestagemaster2003.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:34:05] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [15:34:07] (03CR) 10Brouberol: [C:03+1] Revert cephosd servers from nftables to ferm [puppet] - 10https://gerrit.wikimedia.org/r/1079545 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [15:34:10] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudlb2004-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTARTand with Dell SCP reboot policy FORCED [15:34:16] 10ops-codfw, 06DC-Ops, 06serviceops: Q2:rack/setup/install kubestage200[3-4] - https://phabricator.wikimedia.org/T377009#10221782 (10RobH) [15:34:16] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/4292/co" [puppet] - 10https://gerrit.wikimedia.org/r/1079545 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [15:34:58] FIRING: [3x] KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [15:35:19] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops, and 2 others: codfw:frack:rack/install/configuration new switches - https://phabricator.wikimedia.org/T374587#10221749 (10Papaul) 05Open→03Resolved This is complete switches are in production [15:35:32] (03CR) 10Btullis: [V:03+1 C:03+2] Revert cephosd servers from nftables to ferm [puppet] - 10https://gerrit.wikimedia.org/r/1079545 (https://phabricator.wikimedia.org/T327259) (owner: 10Btullis) [15:36:15] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host cloudlb2004-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTARTand with Dell SCP reboot policy FORCED [15:36:55] RESOLVED: [2x] SystemdUnitFailed: imagecatalog_record.service on deploy2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:37:30] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add new entries for codfw cloudgw - cmooney@cumin1002" [15:37:34] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add new entries for codfw cloudgw - cmooney@cumin1002" [15:37:34] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:40:07] 10ops-codfw, 06DC-Ops, 06serviceops: Q2:rack/setup/install kubestage200[3-4] - https://phabricator.wikimedia.org/T377009#10221838 (10RobH) a:03Clement_Goubert @Clement_Goubert, Please note the workflow for racking tasks has changed this fiscal year, and we now require the puppet updates from the sub-team... [15:40:17] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker21[28-35] - https://phabricator.wikimedia.org/T377007#10221841 (10RobH) a:03Clement_Goubert @Clement_Goubert, Please note the workflow for racking tasks has changed this fiscal year, and we now require the puppet updates fr... [15:40:29] !log btullis@cumin1002 START - Cookbook sre.ceph.roll-restart-reboot-server rolling reboot on A:cephosd [15:41:02] (03PS2) 10JHathaway: dynamicproxy: update to pull ips from proile::resolving [puppet] - 10https://gerrit.wikimedia.org/r/971410 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [15:41:04] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install mc-gp200[4-6] - https://phabricator.wikimedia.org/T376968#10221843 (10RobH) a:03Clement_Goubert @Clement_Goubert, Please note the workflow for racking tasks has changed this fiscal year, and we now require the puppet updates from the sub-... [15:41:11] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/971410 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [15:41:19] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudlb2004-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTARTand with Dell SCP reboot policy FORCED [15:41:27] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker21[56-70] - https://phabricator.wikimedia.org/T376965#10221845 (10RobH) a:03Clement_Goubert @Clement_Goubert, Please note the workflow for racking tasks has changed this fiscal year, and we now require the puppet updates fr... [15:42:27] (03PS11) 10Brouberol: envoy: Fix firewall_srange not being taken into account [puppet] - 10https://gerrit.wikimedia.org/r/1079542 (https://phabricator.wikimedia.org/T327259) [15:42:57] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1079542 (https://phabricator.wikimedia.org/T327259) (owner: 10Brouberol) [15:43:57] FIRING: [3x] KubernetesCalicoDown: kubestagemaster2003.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:44:58] FIRING: [3x] KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [15:47:40] (03PS1) 10JHathaway: puppet-lint: don't warn about legacy facts [puppet] - 10https://gerrit.wikimedia.org/r/1079547 (https://phabricator.wikimedia.org/T372666) [15:48:57] FIRING: [3x] KubernetesCalicoDown: kubestagemaster2003.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:49:30] (03PS3) 10JHathaway: dynamicproxy: update to pull ips from proile::resolving [puppet] - 10https://gerrit.wikimedia.org/r/971410 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [15:49:33] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/971410 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [15:49:50] (03CR) 10JHathaway: [C:03+2] puppet-lint: don't warn about legacy facts [puppet] - 10https://gerrit.wikimedia.org/r/1079547 (https://phabricator.wikimedia.org/T372666) (owner: 10JHathaway) [15:50:22] (03CR) 10RLazarus: [C:03+1] "Thanks for this!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1078666 (https://phabricator.wikimedia.org/T376714) (owner: 10Alexandros Kosiaris) [15:51:55] FIRING: [2x] SystemdUnitFailed: rsyslog-imfile-remedy.service on kubestagemaster2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:54:58] FIRING: [3x] KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [15:58:08] (03PS2) 10JHathaway: acme_chief::cloud: update to use dnsquery::a [puppet] - 10https://gerrit.wikimedia.org/r/971411 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [15:58:20] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/971411 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [15:58:41] (03CR) 10CI reject: [V:04-1] acme_chief::cloud: update to use dnsquery::a [puppet] - 10https://gerrit.wikimedia.org/r/971411 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [16:02:19] (03CR) 10JHathaway: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/971411 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [16:07:47] (03PS1) 10JHathaway: Revert "puppet-lint: don't warn about legacy facts" [puppet] - 10https://gerrit.wikimedia.org/r/1079549 [16:08:27] (03CR) 10JHathaway: [C:03+2] Revert "puppet-lint: don't warn about legacy facts" [puppet] - 10https://gerrit.wikimedia.org/r/1079549 (owner: 10JHathaway) [16:09:17] (03PS3) 10JHathaway: acme_chief::cloud: update to use dnsquery::a [puppet] - 10https://gerrit.wikimedia.org/r/971411 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [16:09:58] FIRING: [3x] KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [16:10:30] !log kcvelaga@deploy2002 Started deploy [airflow-dags/analytics_product@1fb69c4]: T376456 [16:10:46] T376456: ETL pipeline to calculate potential vandalism handed by Automoderator - https://phabricator.wikimedia.org/T376456 [16:10:55] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [16:11:34] !log kcvelaga@deploy2002 Finished deploy [airflow-dags/analytics_product@1fb69c4]: T376456 (duration: 01m 15s) [16:11:55] RESOLVED: SystemdUnitFailed: rsyslog-imfile-remedy.service on kubestagemaster2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:12:39] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/971411 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [16:14:11] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding cloudlb2004-dev to codfw - jhancock@cumin2002" [16:14:16] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding cloudlb2004-dev to codfw - jhancock@cumin2002" [16:14:16] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:14:21] (03CR) 10Cwhite: [C:03+1] blackbox/icmp: deployment sites controlled by input parameter instead of ::site [puppet] - 10https://gerrit.wikimedia.org/r/1079531 (https://phabricator.wikimedia.org/T370506) (owner: 10Tiziano Fogli) [16:15:21] (03CR) 10Xcollazo: [C:03+1] "Changes to the following files LGTM:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077504 (https://phabricator.wikimedia.org/T376065) (owner: 10Kimberly Sarabia) [16:16:35] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host cloudlb2004-dev.codfw.wmnet with OS bookworm [16:16:42] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudlb2004-dev - https://phabricator.wikimedia.org/T370678#10222008 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host cloudlb2004-dev.codfw.wmnet with OS bookworm [16:18:57] FIRING: [3x] KubernetesCalicoDown: kubestagemaster2003.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [16:19:58] RESOLVED: [3x] KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [16:20:15] (03CR) 10Ottomata: [C:03+1] Remove legacy UI actions tracking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077504 (https://phabricator.wikimedia.org/T376065) (owner: 10Kimberly Sarabia) [16:20:40] FIRING: KubernetesRsyslogDown: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [16:22:45] 10ops-eqiad, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker12[35-42] - https://phabricator.wikimedia.org/T377021 (10RobH) 03NEW [16:24:07] 10ops-eqiad, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker12[35-42] - https://phabricator.wikimedia.org/T377021#10222049 (10RobH) [16:25:42] 10ops-eqiad, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker12[35-42] - https://phabricator.wikimedia.org/T377021#10222051 (10RobH) a:03Clement_Goubert @Clement_Goubert, Please note the workflow for racking tasks has changed this fiscal year, and we now require the puppet updates from the su... [16:26:58] FIRING: [3x] KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [16:30:40] RESOLVED: KubernetesRsyslogDown: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [16:31:00] (03PS2) 10JHathaway: toolforge::docker::registry: update to use dnsquery::a [puppet] - 10https://gerrit.wikimedia.org/r/971413 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [16:31:04] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/971413 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [16:34:23] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudlb2004-dev.codfw.wmnet with reason: host reimage [16:34:43] (03CR) 10Herron: alertmanager-irc: improve ErrorBudgetBurn SLO alert text (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1078718 (https://phabricator.wikimedia.org/T376740) (owner: 10Herron) [16:35:15] (03Abandoned) 10JHathaway: toolforge::legacy_redirector: don't use the nameservers global [puppet] - 10https://gerrit.wikimedia.org/r/971414 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [16:36:49] (03PS2) 10JHathaway: toolforge::static: don't use the nameservers global [puppet] - 10https://gerrit.wikimedia.org/r/971415 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [16:36:52] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/971415 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [16:36:55] FIRING: SystemdUnitFailed: kube-controller-manager.service on kubestagemaster2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:36:58] FIRING: [3x] KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [16:37:45] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudlb2004-dev.codfw.wmnet with reason: host reimage [16:37:46] !log mfossati@deploy2002 Started deploy [airflow-dags/platform_eng@c1d2914]: bump section topics to v0.16.0 [16:38:27] !log mfossati@deploy2002 Finished deploy [airflow-dags/platform_eng@c1d2914]: bump section topics to v0.16.0 (duration: 01m 06s) [16:39:03] (03CR) 10Nettrom: [C:03+1] "Looks good to me!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1079475 (https://phabricator.wikimedia.org/T376833) (owner: 10Cyndywikime) [16:39:13] (03Abandoned) 10JHathaway: labs::ores::redisproxy: drop unused role [puppet] - 10https://gerrit.wikimedia.org/r/971417 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [16:39:34] (03CR) 10Ottomata: [C:03+1] "+1" [puppet] - 10https://gerrit.wikimedia.org/r/1074414 (https://phabricator.wikimedia.org/T370668) (owner: 10Fabfur) [16:39:49] !log mfossati@deploy2002 Started deploy [airflow-dags/platform_eng@c1d2914]: bump section topics to v0.16.0 [16:40:21] !log mfossati@deploy2002 Finished deploy [airflow-dags/platform_eng@c1d2914]: bump section topics to v0.16.0 (duration: 00m 42s) [16:41:26] FIRING: [6x] ProbeDown: Service kubestagemaster2003:6443 has failed probes (http_staging_codfw_kube_apiserver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:41:43] (03PS2) 10JHathaway: scap::target: update to use dnsquery::a [puppet] - 10https://gerrit.wikimedia.org/r/971419 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [16:41:49] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/971419 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [16:41:58] RESOLVED: [3x] KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [16:43:03] (03CR) 10Majavah: [C:03+2] toolforge::docker::registry: update to use dnsquery::a [puppet] - 10https://gerrit.wikimedia.org/r/971413 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [16:43:10] FIRING: [2x] SystemdUnitFailed: kube-controller-manager.service on kubestagemaster2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:43:33] (03PS2) 10JHathaway: wikilabels::db_proxy: update to use dnsquery::a [puppet] - 10https://gerrit.wikimedia.org/r/971422 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [16:43:35] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/971422 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [16:43:57] RESOLVED: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [16:43:57] FIRING: [3x] KubernetesCalicoDown: kubestagemaster2003.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [16:46:17] (03PS3) 10JHathaway: realm.pp: drop namservers global as it is no longer used [puppet] - 10https://gerrit.wikimedia.org/r/971423 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [16:48:10] FIRING: [2x] SystemdUnitFailed: kube-controller-manager.service on kubestagemaster2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:48:55] (03CR) 10JHathaway: realm.pp: drop namservers global as it is no longer used (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/971423 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [16:49:41] !log btullis@cumin1002 END (PASS) - Cookbook sre.ceph.roll-restart-reboot-server (exit_code=0) rolling reboot on A:cephosd [16:54:34] (03CR) 10JHathaway: [C:03+2] wikilabels::db_proxy: update to use dnsquery::a [puppet] - 10https://gerrit.wikimedia.org/r/971422 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [16:54:44] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10222163 (10phaultfinder) [16:55:09] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-staging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:56:29] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [16:57:05] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [16:57:06] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudlb2004-dev.codfw.wmnet with OS bookworm [16:57:11] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudlb2004-dev - https://phabricator.wikimedia.org/T370678#10222171 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host cloudlb2004-dev.codfw.wmnet with OS bookworm complete... [16:58:13] 10ops-codfw, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker21[36-55] - https://phabricator.wikimedia.org/T377027 (10RobH) 03NEW [16:58:52] 10ops-codfw, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker21[36-55] - https://phabricator.wikimedia.org/T377027#10222192 (10RobH) [16:58:57] RESOLVED: [2x] KubernetesCalicoDown: kubestagemaster2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [17:00:28] 10ops-codfw, 06DC-Ops, 06serviceops: Q2:rack/setup/install wikikube-worker21[36-55] - https://phabricator.wikimedia.org/T377027#10222205 (10RobH) a:03Clement_Goubert @Clement_Goubert, Please note the workflow for racking tasks has changed this fiscal year, and we now require the puppet updates from the su... [17:01:26] RESOLVED: [2x] ProbeDown: Service kubestagemaster2005:6443 has failed probes (http_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#kubestagemaster2005:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:07:08] (03CR) 10Ssingh: [C:03+1] acme_chief::cloud: update to use dnsquery::a [puppet] - 10https://gerrit.wikimedia.org/r/971411 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [17:09:29] (03CR) 10Ssingh: "Looks good but out of curiosity, any reason to not look up profile::resolving::nameservers directly and then use that?" [puppet] - 10https://gerrit.wikimedia.org/r/971415 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [17:09:32] (03CR) 10Ssingh: [C:03+1] toolforge::static: don't use the nameservers global [puppet] - 10https://gerrit.wikimedia.org/r/971415 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [17:11:14] (03CR) 10Ssingh: [C:03+1] "[Same question here and trying to figure this out in general: is there any reason to prefer one over the other? include profile::resolving" [puppet] - 10https://gerrit.wikimedia.org/r/971410 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [17:11:30] FIRING: ProbeDown: Service wdqs1018:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1018:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:12:01] (03CR) 10Ssingh: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/971423 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [17:12:24] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudlb2004-dev - https://phabricator.wikimedia.org/T370678#10222232 (10Jhancock.wm) 05Open→03Resolved @aborrero this is finally ready. turned into a learning opprotunity [17:12:25] 10ops-codfw, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudlb2004-dev - https://phabricator.wikimedia.org/T370678#10222236 (10Jhancock.wm) [17:13:47] (03CR) 10Ssingh: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/971423 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [17:13:52] argh [17:16:30] RESOLVED: ProbeDown: Service wdqs1018:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1018:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:18:17] (03CR) 10Ssingh: [C:03+1] scap::target: update to use dnsquery::a [puppet] - 10https://gerrit.wikimedia.org/r/971419 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [17:26:32] (03CR) 10Ssingh: [C:03+1] "Looks good https://puppet-compiler.wmflabs.org/output/971423/4294/. I would say let's merge this on Monday to be extra sure." [puppet] - 10https://gerrit.wikimedia.org/r/971423 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [17:46:18] 10ops-eqiad, 06DC-Ops, 06serviceops: Q2:rack/setup/install mc-gp100[4-6] - https://phabricator.wikimedia.org/T377032 (10RobH) 03NEW [17:46:39] 10ops-eqiad, 06DC-Ops, 06serviceops: Q2:rack/setup/install mc-gp100[4-6] - https://phabricator.wikimedia.org/T377032#10222328 (10RobH) [17:48:19] 10ops-eqiad, 06DC-Ops, 06serviceops: Q2:rack/setup/install mc-gp100[4-6] - https://phabricator.wikimedia.org/T377032#10222345 (10RobH) a:03Clement_Goubert @Clement_Goubert, Please note the workflow for racking tasks has changed this fiscal year, and we now require the puppet updates from the sub-team rece... [17:55:02] 06SRE-OnFire, 10Incident Tooling: Corto: Functional & Integration testing - https://phabricator.wikimedia.org/T377036 (10Eevans) 03NEW [18:00:00] 06SRE-OnFire, 10Incident Tooling: corto: implement resolve incident - https://phabricator.wikimedia.org/T370783#10222396 (10Eevans) 05Open→03Resolved a:03Eevans Complete (sans tests): https://gitlab.wikimedia.org/repos/sre/corto/-/commit/823578bc2809c92393aad126b604fc5f2f2b301f [18:01:52] (03CR) 10JHathaway: [C:03+2] dynamicproxy: update to pull ips from proile::resolving [puppet] - 10https://gerrit.wikimedia.org/r/971410 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [18:02:00] (03CR) 10JHathaway: [C:03+2] acme_chief::cloud: update to use dnsquery::a [puppet] - 10https://gerrit.wikimedia.org/r/971411 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [18:02:09] (03CR) 10JHathaway: [C:03+2] toolforge::static: don't use the nameservers global [puppet] - 10https://gerrit.wikimedia.org/r/971415 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [18:02:21] (03CR) 10JHathaway: [C:03+2] scap::target: update to use dnsquery::a [puppet] - 10https://gerrit.wikimedia.org/r/971419 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [18:03:26] (03PS4) 10JHathaway: realm.pp: drop namservers global as it is no longer used [puppet] - 10https://gerrit.wikimedia.org/r/971423 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [18:03:37] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/971423 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [18:05:09] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-staging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:11:55] RESOLVED: SystemdUnitFailed: rsyslog-imfile-remedy.service on kubestagemaster2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:20:09] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-staging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:27:45] FIRING: [2x] ProbeDown: Service kubestagemaster2004:6443 has failed probes (http_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#kubestagemaster2004:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:29:02] 07Puppet, 06SRE, 06Infrastructure-Foundations: Facter is slow on a few hosts - https://phabricator.wikimedia.org/T251293#10222459 (10colewhite) 05Open→03Resolved Haven't seen widespread problems with export_smart_data_dump.service [[ https://grafana-rw.wikimedia.org/explore?orgId=1&left=%7B%22datasou... [18:30:43] 06SRE, 10Observability-Metrics, 05Goal, 13Patch-Needs-Improvement: Fully migrate producers off statsd - https://phabricator.wikimedia.org/T205870#10222467 (10colewhite) [18:30:57] FIRING: KubernetesCalicoDown: kubestagemaster2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-staging&var-instance=kubestagemaster2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [18:46:26] FIRING: [6x] ProbeDown: Service kubestagemaster2003:6443 has failed probes (http_staging_codfw_kube_apiserver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:49:57] FIRING: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [18:50:09] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-staging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:50:57] FIRING: [2x] KubernetesCalicoDown: kubestagemaster2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [18:51:28] FIRING: KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [18:55:57] FIRING: [3x] KubernetesCalicoDown: kubestagemaster2003.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [19:01:28] FIRING: [2x] KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [19:29:57] RESOLVED: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [19:30:02] (03CR) 10JHathaway: [C:03+2] "I think they both have pros and cons:" [puppet] - 10https://gerrit.wikimedia.org/r/971410 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [19:30:57] FIRING: [2x] KubernetesCalicoDown: kubestagemaster2003.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [19:31:28] RESOLVED: [2x] KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [19:34:57] FIRING: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [19:35:57] FIRING: [3x] KubernetesCalicoDown: kubestagemaster2003.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [19:36:55] FIRING: SystemdUnitFailed: imagecatalog_record.service on deploy2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:37:58] FIRING: [3x] KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [19:48:10] FIRING: [2x] SystemdUnitFailed: imagecatalog_record.service on deploy2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:50:57] FIRING: [3x] KubernetesCalicoDown: kubestagemaster2003.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [19:52:58] RESOLVED: [2x] KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [19:56:28] FIRING: KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [20:00:57] FIRING: [3x] KubernetesCalicoDown: kubestagemaster2003.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [20:01:28] RESOLVED: KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [20:01:55] FIRING: [3x] SystemdUnitFailed: imagecatalog_record.service on deploy2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:03:58] FIRING: [2x] KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [20:05:57] FIRING: [3x] KubernetesCalicoDown: kubestagemaster2003.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [20:06:43] FIRING: [3x] KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [20:08:40] FIRING: KubernetesRsyslogDown: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [20:13:24] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:18:10] FIRING: [3x] SystemdUnitFailed: imagecatalog_record.service on deploy2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:18:40] RESOLVED: KubernetesRsyslogDown: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [20:21:55] RESOLVED: [2x] SystemdUnitFailed: imagecatalog_record.service on deploy2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:23:58] FIRING: [2x] KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [20:29:57] RESOLVED: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [20:30:57] FIRING: [3x] KubernetesCalicoDown: kubestagemaster2003.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [20:33:57] FIRING: CalicoKubeControllersDown: Calico Kubernetes Controllers not running - https://wikitech.wikimedia.org/wiki/Calico#Kube_Controllers" - TODO - https://alerts.wikimedia.org/?q=alertname%3DCalicoKubeControllersDown [20:39:09] FIRING: [7x] KubernetesAPILatency: High Kubernetes API latency (GET ) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:48:58] FIRING: [2x] KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [20:49:09] RESOLVED: [10x] KubernetesAPILatency: High Kubernetes API latency (GET ) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:54:40] FIRING: KubernetesRsyslogDown: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [20:59:43] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T376235#10222755 (10phaultfinder) [21:01:26] FIRING: [8x] ProbeDown: Service install2004:8080 has failed probes (http_squid_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:02:45] FIRING: [8x] ProbeDown: Service install2004:8080 has failed probes (http_squid_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:03:58] RESOLVED: [2x] KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [21:04:40] RESOLVED: KubernetesRsyslogDown: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [21:05:57] FIRING: [2x] KubernetesCalicoDown: kubestagemaster2003.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [21:06:55] FIRING: SystemdUnitFailed: wmf_auto_restart_ssh.service on kubestagemaster2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:09:43] FIRING: KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [21:10:57] FIRING: [3x] KubernetesCalicoDown: kubestagemaster2003.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [21:10:58] FIRING: [2x] KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [21:12:24] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:14:10] !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host cephosd1001.eqiad.wmnet [21:14:43] RESOLVED: KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [21:16:55] RESOLVED: SystemdUnitFailed: wmf_auto_restart_ssh.service on kubestagemaster2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:20:57] FIRING: [2x] KubernetesCalicoDown: kubestagemaster2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [21:22:28] FIRING: KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [21:24:19] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cephosd1001.eqiad.wmnet [21:25:57] FIRING: [3x] KubernetesCalicoDown: kubestagemaster2003.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [21:26:04] !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host cephosd1002.eqiad.wmnet [21:26:55] FIRING: SystemdUnitFailed: wmf_auto_restart_ssh.service on kubestagemaster2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:27:28] RESOLVED: KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [21:28:28] FIRING: KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [21:29:21] (03PS1) 10Xcollazo: Fix security checksum for web_request's refinery-drop-older-than [puppet] - 10https://gerrit.wikimedia.org/r/1079573 (https://phabricator.wikimedia.org/T376882) [21:30:57] FIRING: [3x] KubernetesCalicoDown: kubestagemaster2003.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [21:32:43] FIRING: [2x] KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [21:36:09] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cephosd1002.eqiad.wmnet [21:38:52] !log btullis@cumin1002 START - Cookbook sre.ceph.roll-restart-reboot-server rolling reboot on P{cephosd100[3-5]*} and (A:cephosd) [21:39:39] (03CR) 10Btullis: [C:03+2] Fix security checksum for web_request's refinery-drop-older-than [puppet] - 10https://gerrit.wikimedia.org/r/1079573 (https://phabricator.wikimedia.org/T376882) (owner: 10Xcollazo) [21:43:24] 06SRE-OnFire, 10Incident Tooling: Functional tests are tied to user accounts - https://phabricator.wikimedia.org/T377047 (10Eevans) 03NEW [21:52:43] FIRING: [2x] KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [21:53:28] RESOLVED: [2x] KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [21:56:55] RESOLVED: SystemdUnitFailed: wmf_auto_restart_ssh.service on kubestagemaster2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:59:58] FIRING: [2x] KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [22:00:57] FIRING: [3x] KubernetesCalicoDown: kubestagemaster2003.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [22:01:26] FIRING: [6x] ProbeDown: Service kubestagemaster2003:6443 has failed probes (http_staging_codfw_kube_apiserver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:02:43] RESOLVED: [2x] KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [22:06:26] FIRING: [6x] ProbeDown: Service kubestagemaster2003:6443 has failed probes (http_staging_codfw_kube_apiserver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:07:43] FIRING: [2x] KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [22:10:57] FIRING: [3x] KubernetesCalicoDown: kubestagemaster2003.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [22:16:09] FIRING: [2x] KubernetesAPILatency: High Kubernetes API latency (GET leases) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-staging&var-latency_percentile=0.95&var-verb=GET - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:18:42] !log btullis@cumin1002 END (PASS) - Cookbook sre.ceph.roll-restart-reboot-server (exit_code=0) rolling reboot on P{cephosd100[3-5]*} and (A:cephosd) [22:21:09] RESOLVED: [9x] KubernetesAPILatency: High Kubernetes API latency (GET ) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:21:39] FIRING: [5x] KubernetesAPILatency: High Kubernetes API latency (GET ) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:24:58] FIRING: [2x] KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [22:26:39] RESOLVED: [9x] KubernetesAPILatency: High Kubernetes API latency (GET ) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:27:43] RESOLVED: [2x] KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [22:30:57] RESOLVED: [2x] KubernetesCalicoDown: kubestagemaster2003.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [22:31:28] FIRING: KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [22:31:55] FIRING: SystemdUnitFailed: wmf_auto_restart_ssh.service on kubestagemaster2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:32:27] FIRING: [2x] KubernetesCalicoDown: kubestagemaster2003.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [22:34:19] 06SRE, 10Charts, 13Patch-For-Review, 07Service-deployment-requests: New Service Request: chart-renderer - https://phabricator.wikimedia.org/T376939#10222908 (10Jdlrobson) p:05Triage→03High [22:36:28] FIRING: [2x] KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [22:37:27] FIRING: [3x] KubernetesCalicoDown: kubestagemaster2003.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [22:41:28] FIRING: [3x] KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [22:43:40] FIRING: KubernetesRsyslogDown: rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2001 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [22:48:40] RESOLVED: KubernetesRsyslogDown: rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2001 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [22:51:26] FIRING: [6x] ProbeDown: Service kubestagemaster2003:6443 has failed probes (http_staging_codfw_kube_apiserver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:51:28] RESOLVED: [3x] KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [22:51:55] RESOLVED: SystemdUnitFailed: wmf_auto_restart_ssh.service on kubestagemaster2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:52:27] FIRING: [2x] KubernetesCalicoDown: kubestagemaster2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [22:52:45] FIRING: [6x] ProbeDown: Service kubestagemaster2003:6443 has failed probes (http_staging_codfw_kube_apiserver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:56:55] FIRING: [2x] SystemdUnitFailed: kube-controller-manager.service on kubestagemaster2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:57:27] FIRING: [2x] KubernetesCalicoDown: kubestagemaster2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [22:57:28] FIRING: KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [22:58:10] RESOLVED: [2x] SystemdUnitFailed: kube-controller-manager.service on kubestagemaster2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:01:40] FIRING: KubernetesRsyslogDown: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [23:01:43] RESOLVED: [2x] KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [23:03:10] FIRING: [2x] SystemdUnitFailed: kube-controller-manager.service on kubestagemaster2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:04:58] FIRING: [2x] KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [23:07:27] FIRING: [2x] KubernetesCalicoDown: kubestagemaster2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [23:11:55] RESOLVED: [2x] SystemdUnitFailed: kube-controller-manager.service on kubestagemaster2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:16:43] RESOLVED: [2x] KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [23:16:55] FIRING: [3x] SystemdUnitFailed: imagecatalog_record.service on deploy2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:18:10] FIRING: [3x] SystemdUnitFailed: imagecatalog_record.service on deploy2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:19:58] FIRING: [2x] KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [23:21:31] 06SRE-OnFire, 10Incident Tooling: corto: implement changing the IC - https://phabricator.wikimedia.org/T370784#10222999 (10Eevans) 05Open→03Resolved a:03Eevans Done (see: [[ https://gitlab.wikimedia.org/repos/sre/corto/-/commit/20d029f7f48fa6259c93dcca76aba679283c3a08 | 20d029f ]]) [23:23:10] FIRING: [2x] SystemdUnitFailed: imagecatalog_record.service on deploy2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:23:13] RESOLVED: [2x] KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [23:24:43] FIRING: KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [23:24:58] FIRING: [2x] KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [23:26:40] RESOLVED: KubernetesRsyslogDown: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [23:29:43] FIRING: [3x] KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [23:29:58] FIRING: [3x] KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [23:32:27] FIRING: [3x] KubernetesCalicoDown: kubestagemaster2003.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [23:36:55] FIRING: [2x] SystemdUnitFailed: imagecatalog_record.service on deploy2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:38:43] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1079580 [23:38:43] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1079580 (owner: 10TrainBranchBot) [23:39:43] FIRING: [3x] KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [23:41:55] FIRING: [2x] SystemdUnitFailed: imagecatalog_record.service on deploy2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:42:27] FIRING: [3x] KubernetesCalicoDown: kubestagemaster2003.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [23:44:43] FIRING: [3x] KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [23:46:55] FIRING: [3x] SystemdUnitFailed: imagecatalog_record.service on deploy2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:47:27] FIRING: [3x] KubernetesCalicoDown: kubestagemaster2003.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [23:57:27] FIRING: [3x] KubernetesCalicoDown: kubestagemaster2003.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [23:59:40] FIRING: KubernetesRsyslogDown: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [23:59:58] RESOLVED: [3x] KubernetesAPINotScrapable: k8s-staging@codfw is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable