[00:06:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[00:31:36] <jinxer-wm>	 (GitLabCIPipelineErrors) firing: GitLab - High pipeline error rate - https://wikitech.wikimedia.org/wiki/GitLab/Runbook - https://grafana.wikimedia.org/d/Chb-gC07k/gitlab-ci-overview - https://alerts.wikimedia.org/?q=alertname%3DGitLabCIPipelineErrors
[00:36:36] <jinxer-wm>	 (GitLabCIPipelineErrors) resolved: GitLab - High pipeline error rate - https://wikitech.wikimedia.org/wiki/GitLab/Runbook - https://grafana.wikimedia.org/d/Chb-gC07k/gitlab-ci-overview - https://alerts.wikimedia.org/?q=alertname%3DGitLabCIPipelineErrors
[00:38:28] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/939271
[00:38:34] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/939271 (owner: 10TrainBranchBot)
[00:53:35] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/939271 (owner: 10TrainBranchBot)
[01:14:52] <wikibugs>	 (03CR) 10Gergő Tisza: "Do you want to update the edit summary, now that the patch does a bunch of changes not really related to cswiki? And maybe mention the DB " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938915 (https://phabricator.wikimedia.org/T342034) (owner: 10Urbanecm)
[01:27:54] <icinga-wm>	 PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-ext_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:30:50] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[01:53:36] <jinxer-wm>	 (GitLabCIPipelineErrors) firing: GitLab - High pipeline error rate - https://wikitech.wikimedia.org/wiki/GitLab/Runbook - https://grafana.wikimedia.org/d/Chb-gC07k/gitlab-ci-overview - https://alerts.wikimedia.org/?q=alertname%3DGitLabCIPipelineErrors
[01:56:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[01:58:14] <wikibugs>	 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T342197 (10phaultfinder)
[01:58:36] <jinxer-wm>	 (GitLabCIPipelineErrors) resolved: GitLab - High pipeline error rate - https://wikitech.wikimedia.org/wiki/GitLab/Runbook - https://grafana.wikimedia.org/d/Chb-gC07k/gitlab-ci-overview - https://alerts.wikimedia.org/?q=alertname%3DGitLabCIPipelineErrors
[02:00:31] <wikibugs>	 10SRE, 10ops-knams, 10DC-Ops: Relocate one of the mx480 from knams to esams - https://phabricator.wikimedia.org/T342198 (10Papaul)
[02:01:03] <wikibugs>	 10SRE, 10ops-knams, 10DC-Ops: Relocate one of the mx480 from knams to esams - https://phabricator.wikimedia.org/T342198 (10Papaul) p:05Triage→03Medium a:05wiki_willy→03Papaul
[02:01:15] <wikibugs>	 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T342197 (10phaultfinder)
[02:03:15] <wikibugs>	 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T342197 (10phaultfinder)
[02:06:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[02:06:32] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:08:16] <wikibugs>	 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T342197 (10phaultfinder)
[02:09:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[02:13:25] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) load-categories-daily.service Failed on wdqs2021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:14:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[02:17:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[02:24:40] <icinga-wm>	 RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:28:38] <icinga-wm>	 PROBLEM - Check systemd state on dumpsdata1003 is CRITICAL: CRITICAL - degraded: The following units failed: cleanup_tmpdumps.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:31:32] <jinxer-wm>	 (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:33:44] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[02:57:55] <wikibugs>	 (03PS1) 10Sohom Datta: Enable EditInSequence in pawikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939392 (https://phabricator.wikimedia.org/T341786)
[03:23:37] <wikibugs>	 (03CR) 10Tim Starling: IP Masking: Enable for cswiki beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938915 (https://phabricator.wikimedia.org/T342034) (owner: 10Urbanecm)
[03:31:06] <icinga-wm>	 PROBLEM - Host parse1002 is DOWN: PING CRITICAL - Packet loss = 100%
[04:17:45] <wikibugs>	 (03PS1) 10David Martin: Create puppet scripting for sqooping Wikifunctions tables [puppet] - 10https://gerrit.wikimedia.org/r/939394 (https://phabricator.wikimedia.org/T342199)
[04:34:59] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1198: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/939337
[04:37:17] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "db1198: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/939337 (owner: 10Marostegui)
[04:37:41] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1198 (re)pooling @ 1%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49574 and previous config saved to /var/cache/conftool/dbconfig/20230719-043740-root.json
[04:37:58] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: db1198 crashed - https://phabricator.wikimedia.org/T342129 (10Marostegui) Host being repooed.
[04:52:46] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1198 (re)pooling @ 3%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49575 and previous config saved to /var/cache/conftool/dbconfig/20230719-045245-root.json
[05:07:50] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1198 (re)pooling @ 5%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49576 and previous config saved to /var/cache/conftool/dbconfig/20230719-050750-root.json
[05:22:55] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1198 (re)pooling @ 10%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49577 and previous config saved to /var/cache/conftool/dbconfig/20230719-052254-root.json
[05:27:51] <wikibugs>	 (03PS4) 10Abijeet Patro: Add channel for TtmServerMessageUpdate of Translate extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927701
[05:38:00] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1198 (re)pooling @ 25%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49578 and previous config saved to /var/cache/conftool/dbconfig/20230719-053759-root.json
[05:53:04] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1198 (re)pooling @ 50%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49579 and previous config saved to /var/cache/conftool/dbconfig/20230719-055304-root.json
[06:00:04] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230719T0600)
[06:08:09] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1198 (re)pooling @ 75%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49580 and previous config saved to /var/cache/conftool/dbconfig/20230719-060809-root.json
[06:08:33] <wikibugs>	 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T342197 (10phaultfinder)
[06:14:23] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) load-categories-daily.service Failed on wdqs2021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:17:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[06:23:14] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1198 (re)pooling @ 100%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49581 and previous config saved to /var/cache/conftool/dbconfig/20230719-062313-root.json
[07:00:04] <jouncebot>	 Amir1, Urbanecm, and taavi: (Dis)respected human, time to deploy UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230719T0700). Please do the needful.
[07:00:05] <jouncebot>	 dcausse and abijeet: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:00:38] <dcausse>	 o/
[07:02:23] <abijeet>	 o/
[07:09:34] <dcausse>	 I suppose I can deploy unless there are objections
[07:12:05] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2158', diff saved to https://phabricator.wikimedia.org/P49582 and previous config saved to /var/cache/conftool/dbconfig/20230719-071204-root.json
[07:12:45] <wikibugs>	 (03PS10) 10JMeybohm: kubernetes::master: Publish service-account cert to etcd [puppet] - 10https://gerrit.wikimedia.org/r/937442 (https://phabricator.wikimedia.org/T329826)
[07:12:47] <wikibugs>	 (03PS4) 10JMeybohm: kubernetes: Add etcd srv names to clusterconfig structure [puppet] - 10https://gerrit.wikimedia.org/r/937793 (https://phabricator.wikimedia.org/T329826)
[07:12:55] <wikibugs>	 (03CR) 10JMeybohm: kubernetes::master: Publish service-account cert to etcd (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/937442 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm)
[07:15:08] <wikibugs>	 (03PS1) 10Marostegui: db2158: Migrate to MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/939622 (https://phabricator.wikimedia.org/T334650)
[07:15:10] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42546/console" [puppet] - 10https://gerrit.wikimedia.org/r/937442 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm)
[07:15:42] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db2158: Migrate to MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/939622 (https://phabricator.wikimedia.org/T334650) (owner: 10Marostegui)
[07:15:45] <dcausse>	 abijeet: deploying your config change
[07:17:24] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by dcausse@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927701 (owner: 10Abijeet Patro)
[07:17:45] <abijeet>	 dcausse, ok, thanks!
[07:17:56] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2158 (re)pooling @ 1%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49583 and previous config saved to /var/cache/conftool/dbconfig/20230719-071755-root.json
[07:18:03] <wikibugs>	 (03Merged) 10jenkins-bot: Add channel for TtmServerMessageUpdate of Translate extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927701 (owner: 10Abijeet Patro)
[07:18:49] <logmsgbot>	 !log dcausse@deploy1002 Started scap: Backport for [[gerrit:927701|Add channel for TtmServerMessageUpdate of Translate extension]]
[07:18:52] <dcausse>	 abijeet: I suppose this is affecting code in ttm update jobs and thus can't be tested on mw-debug servers?
[07:19:10] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] confd: allow running multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/937573 (https://phabricator.wikimedia.org/T341669) (owner: 10Giuseppe Lavagetto)
[07:20:26] <logmsgbot>	 !log dcausse@deploy1002 dcausse and abi: Backport for [[gerrit:927701|Add channel for TtmServerMessageUpdate of Translate extension]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[07:22:08] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1180', diff saved to https://phabricator.wikimedia.org/P49584 and previous config saved to /var/cache/conftool/dbconfig/20230719-072207-root.json
[07:22:35] <dcausse>	 abijeet: it's live on mw-debug please let me know if you want me to proceed
[07:23:00] <wikibugs>	 (03PS1) 10Marostegui: db1180: Migrate to mariadb 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/939623 (https://phabricator.wikimedia.org/T334650)
[07:23:54] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1180: Migrate to mariadb 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/939623 (https://phabricator.wikimedia.org/T334650) (owner: 10Marostegui)
[07:24:16] <abijeet>	 dcausse, I think we can proceed
[07:24:22] <dcausse>	 sure
[07:26:33] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1180 (re)pooling @ 1%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49585 and previous config saved to /var/cache/conftool/dbconfig/20230719-072632-root.json
[07:30:12] <dcausse>	 saw "connect to host parse1002.eqiad.wmnet port 22: Connection timed out" during sync-apaches, is this something we should be worried be about?
[07:31:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[07:33:00] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2158 (re)pooling @ 3%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49586 and previous config saved to /var/cache/conftool/dbconfig/20230719-073300-root.json
[07:36:34] <logmsgbot>	 !log dcausse@deploy1002 Finished scap: Backport for [[gerrit:927701|Add channel for TtmServerMessageUpdate of Translate extension]] (duration: 17m 44s)
[07:36:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[07:39:31] <dcausse>	 abijeet: deploy done but I got warnings on the server parse1002.eqiad.wmnet, I believe this is unrelated to your change
[07:40:18] <abijeet>	 dcausse, yea, i think so too. 
[07:41:17] <abijeet>	 dcausse, like you said the change just enables a log for ttm update jobs
[07:41:26] <abijeet>	 log channel*
[07:41:38] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1180 (re)pooling @ 3%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49587 and previous config saved to /var/cache/conftool/dbconfig/20230719-074137-root.json
[07:44:18] <_joe_>	 are deployments still ongoing?
[07:44:56] <dcausse>	 _joe_: scap backport ended
[07:45:11] <dcausse>	 I have two patches to deploy but haven't started them yet
[07:45:19] <_joe_>	 ok then gimmie a sec
[07:45:41] <logmsgbot>	 !log oblivian@cumin1001 conftool action : set/pooled=inactive; selector: name=parse1002.eqiad.wmnet
[07:45:50] <_joe_>	 dcausse: please proceed
[07:45:53] <dcausse>	 _joe_: thanks!
[07:46:38] <logmsgbot>	 !log dcausse@deploy1002 Backport cancelled.
[07:47:06] <_joe_>	 !log powercycling parse1002, console blank, unreachable to network
[07:47:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:47:25] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by dcausse@deploy1002 using scap backport" [extensions/CirrusSearch] (wmf/1.41.0-wmf.18) - 10https://gerrit.wikimedia.org/r/939328 (owner: 10DCausse)
[07:48:05] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2158 (re)pooling @ 5%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49588 and previous config saved to /var/cache/conftool/dbconfig/20230719-074804-root.json
[07:49:41] <icinga-wm>	 RECOVERY - Host parse1002 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms
[07:51:26] <dcausse>	 jouncebot: next
[07:51:26] <jouncebot>	 In 2 hour(s) and 8 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230719T1000)
[07:51:52] <_joe_>	 dcausse: lmk when scap backport finished
[07:52:00] <dcausse>	 sure
[07:52:12] <_joe_>	 I'll repool parse1002
[07:54:15] <dcausse>	 might take time tho, waiting for CI on an extension
[07:54:23] <_joe_>	 ah ok
[07:54:27] <_joe_>	 then let me repool now
[07:54:52] <_joe_>	 !log ran scap pull, pool on parse1002 after powercycling
[07:54:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:56:11] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on parse1002 is CRITICAL: Host parse1002 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[07:56:18] <_joe_>	 shush
[07:56:21] <_joe_>	 it's actually fixed
[07:56:30] <_joe_>	 stupid icinga
[07:56:42] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1180 (re)pooling @ 5%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49589 and previous config saved to /var/cache/conftool/dbconfig/20230719-075642-root.json
[08:01:04] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] "While I would like this to be more DRY, it's ok as a first addition." [puppet] - 10https://gerrit.wikimedia.org/r/928159 (https://phabricator.wikimedia.org/T323192) (owner: 10Abijeet Patro)
[08:02:31] <wikibugs>	 (03Merged) 10jenkins-bot: Use the LinksUpdate::isRecursive flag again to route cirrusSearchLinksUpdate [extensions/CirrusSearch] (wmf/1.41.0-wmf.18) - 10https://gerrit.wikimedia.org/r/939328 (owner: 10DCausse)
[08:02:59] <logmsgbot>	 !log dcausse@deploy1002 Started scap: Backport for [[gerrit:939328|Use the LinksUpdate::isRecursive flag again to route cirrusSearchLinksUpdate]]
[08:03:10] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2158 (re)pooling @ 10%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49590 and previous config saved to /var/cache/conftool/dbconfig/20230719-080309-root.json
[08:04:29] <logmsgbot>	 !log dcausse@deploy1002 dcausse: Backport for [[gerrit:939328|Use the LinksUpdate::isRecursive flag again to route cirrusSearchLinksUpdate]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[08:10:35] <logmsgbot>	 !log dcausse@deploy1002 Finished scap: Backport for [[gerrit:939328|Use the LinksUpdate::isRecursive flag again to route cirrusSearchLinksUpdate]] (duration: 07m 36s)
[08:11:01] <dcausse>	 _joe_: all good with the last deploy, thanks for the quick fix!
[08:11:47] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1180 (re)pooling @ 10%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49591 and previous config saved to /var/cache/conftool/dbconfig/20230719-081146-root.json
[08:11:48] <dcausse>	 going to extend the deploy window for another patch unless someone has objections
[08:12:54] <wikibugs>	 (03PS4) 10Jbond: monitoring: fix bashisms and other minor lint issues [puppet] - 10https://gerrit.wikimedia.org/r/938897 (https://phabricator.wikimedia.org/T95064)
[08:13:11] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by dcausse@deploy1002 using scap backport" [extensions/CirrusSearch] (wmf/1.41.0-wmf.17) - 10https://gerrit.wikimedia.org/r/939327 (owner: 10DCausse)
[08:13:15] <wikibugs>	 (03CR) 10Jbond: "done thanks" [puppet] - 10https://gerrit.wikimedia.org/r/938897 (https://phabricator.wikimedia.org/T95064) (owner: 10Jbond)
[08:13:42] <wikibugs>	 10SRE, 10Traffic: increased 5xx rate for esams frontend traffic - https://phabricator.wikimedia.org/T342121 (10TheDJ) This problem was also pretty visible on the wikimediastatus.net graph, I just noticed.  {F37143438}
[08:13:47] <wikibugs>	 10SRE, 10Traffic: increased 5xx rate for esams frontend traffic - https://phabricator.wikimedia.org/T342121 (10TheDJ) a:05TheDJ→03cmooney
[08:15:25] <wikibugs>	 (03PS5) 10Jbond: install_server: drop Bashisms [puppet] - 10https://gerrit.wikimedia.org/r/938898 (https://phabricator.wikimedia.org/T95064)
[08:16:19] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: mediawki::maintenance::translationnotifications: fix calendar defintions [puppet] - 10https://gerrit.wikimedia.org/r/939629
[08:16:34] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawki::maintenance::translationnotifications: fix calendar defintions [puppet] - 10https://gerrit.wikimedia.org/r/939629 (owner: 10Giuseppe Lavagetto)
[08:17:34] <wikibugs>	 (03CR) 10Abijeet Patro: [C: 03+1] mediawki::maintenance::translationnotifications: fix calendar defintions [puppet] - 10https://gerrit.wikimedia.org/r/939629 (owner: 10Giuseppe Lavagetto)
[08:18:15] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2158 (re)pooling @ 25%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49592 and previous config saved to /var/cache/conftool/dbconfig/20230719-081814-root.json
[08:20:52] <wikibugs>	 (03PS11) 10JMeybohm: kubernetes::master: Publish service-account cert to etcd [puppet] - 10https://gerrit.wikimedia.org/r/937442 (https://phabricator.wikimedia.org/T329826)
[08:20:54] <wikibugs>	 (03PS5) 10JMeybohm: kubernetes: Add etcd srv names to clusterconfig structure [puppet] - 10https://gerrit.wikimedia.org/r/937793 (https://phabricator.wikimedia.org/T329826)
[08:20:56] <wikibugs>	 (03PS11) 10JMeybohm: confd: allow running multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/937573 (https://phabricator.wikimedia.org/T341669) (owner: 10Giuseppe Lavagetto)
[08:20:58] <wikibugs>	 (03PS1) 10JMeybohm: kubernetes::master: Add confd config writing all sa certs [puppet] - 10https://gerrit.wikimedia.org/r/939630 (https://phabricator.wikimedia.org/T329826)
[08:22:06] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: translationnotifications: brown paper bag fix [puppet] - 10https://gerrit.wikimedia.org/r/939631
[08:22:20] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] translationnotifications: brown paper bag fix [puppet] - 10https://gerrit.wikimedia.org/r/939631 (owner: 10Giuseppe Lavagetto)
[08:22:33] <wikibugs>	 (03PS2) 10JMeybohm: kubernetes::master: Add confd config writing all sa certs [puppet] - 10https://gerrit.wikimedia.org/r/939630 (https://phabricator.wikimedia.org/T329826)
[08:24:45] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] kubernetes::master: Add confd config writing all sa certs [puppet] - 10https://gerrit.wikimedia.org/r/939630 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm)
[08:26:52] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1180 (re)pooling @ 25%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49593 and previous config saved to /var/cache/conftool/dbconfig/20230719-082651-root.json
[08:27:25] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] kubernetes::master: Add confd config writing all sa certs [puppet] - 10https://gerrit.wikimedia.org/r/939630 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm)
[08:28:48] <wikibugs>	 (03Merged) 10jenkins-bot: Use the LinksUpdate::isRecursive flag again to route cirrusSearchLinksUpdate [extensions/CirrusSearch] (wmf/1.41.0-wmf.17) - 10https://gerrit.wikimedia.org/r/939327 (owner: 10DCausse)
[08:29:14] <logmsgbot>	 !log dcausse@deploy1002 Started scap: Backport for [[gerrit:939327|Use the LinksUpdate::isRecursive flag again to route cirrusSearchLinksUpdate]]
[08:30:03] <wikibugs>	 (03PS12) 10JMeybohm: confd: allow running multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/937573 (https://phabricator.wikimedia.org/T341669) (owner: 10Giuseppe Lavagetto)
[08:30:05] <wikibugs>	 (03PS3) 10JMeybohm: kubernetes::master: Add confd config writing all sa certs [puppet] - 10https://gerrit.wikimedia.org/r/939630 (https://phabricator.wikimedia.org/T329826)
[08:30:12] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] service::catalog: add prometheus-https [puppet] - 10https://gerrit.wikimedia.org/r/939326 (https://phabricator.wikimedia.org/T301944) (owner: 10Herron)
[08:30:47] <logmsgbot>	 !log dcausse@deploy1002 dcausse: Backport for [[gerrit:939327|Use the LinksUpdate::isRecursive flag again to route cirrusSearchLinksUpdate]] synced to the testservers mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[08:32:07] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42548/console" [puppet] - 10https://gerrit.wikimedia.org/r/937573 (https://phabricator.wikimedia.org/T341669) (owner: 10Giuseppe Lavagetto)
[08:32:52] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Platform-SRE, 10Infrastructure-Foundations: analytics1073 and analytics1075 - loss of connectivity - https://phabricator.wikimedia.org/T342141 (10BTullis) >>! In T342141#9026115, @Papaul wrote: > @BTullis we had the same issue with sessionstore2001 in codw see task below what we...
[08:33:05] <wikibugs>	 (03PS4) 10JMeybohm: kubernetes::master: Add confd config writing all sa certs [puppet] - 10https://gerrit.wikimedia.org/r/939630 (https://phabricator.wikimedia.org/T329826)
[08:33:14] <wikibugs>	 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Create 'titan' role and put new hosts in service - https://phabricator.wikimedia.org/T341999 (10fgiunchedi) p:05Medium→03High Sure why not, {{done}}
[08:33:19] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2158 (re)pooling @ 50%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49594 and previous config saved to /var/cache/conftool/dbconfig/20230719-083319-root.json
[08:33:22] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] kubernetes::master: Add confd config writing all sa certs [puppet] - 10https://gerrit.wikimedia.org/r/939630 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm)
[08:35:56] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] kubernetes::master: Add confd config writing all sa certs [puppet] - 10https://gerrit.wikimedia.org/r/939630 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm)
[08:36:33] <wikibugs>	 (03PS1) 10Elukey: ml-services: update Docker image for ores-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/939633 (https://phabricator.wikimedia.org/T341479)
[08:37:01] <wikibugs>	 (03PS13) 10JMeybohm: confd: allow running multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/937573 (https://phabricator.wikimedia.org/T341669) (owner: 10Giuseppe Lavagetto)
[08:37:03] <wikibugs>	 (03PS5) 10JMeybohm: kubernetes::master: Add confd config writing all sa certs [puppet] - 10https://gerrit.wikimedia.org/r/939630 (https://phabricator.wikimedia.org/T329826)
[08:37:13] <logmsgbot>	 !log dcausse@deploy1002 Finished scap: Backport for [[gerrit:939327|Use the LinksUpdate::isRecursive flag again to route cirrusSearchLinksUpdate]] (duration: 07m 59s)
[08:38:12] <dcausse>	 !log closing the UTC morning backport window
[08:38:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:38:30] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] ml-services: update Docker image for ores-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/939633 (https://phabricator.wikimedia.org/T341479) (owner: 10Elukey)
[08:39:34] <wikibugs>	 (03PS11) 10Jbond: ssh: switch to using the same file we use in production [puppet] - 10https://gerrit.wikimedia.org/r/936692 (https://phabricator.wikimedia.org/T340947)
[08:40:04] <wikibugs>	 (03CR) 10Jbond: ssh: switch to using the same file we use in production (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/936692 (https://phabricator.wikimedia.org/T340947) (owner: 10Jbond)
[08:41:06] <wikibugs>	 (03PS14) 10JMeybohm: confd: allow running multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/937573 (https://phabricator.wikimedia.org/T341669) (owner: 10Giuseppe Lavagetto)
[08:41:08] <wikibugs>	 (03PS6) 10JMeybohm: kubernetes::master: Add confd config writing all sa certs [puppet] - 10https://gerrit.wikimedia.org/r/939630 (https://phabricator.wikimedia.org/T329826)
[08:41:56] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1180 (re)pooling @ 50%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49595 and previous config saved to /var/cache/conftool/dbconfig/20230719-084156-root.json
[08:42:16] <wikibugs>	 10SRE, 10Traffic, 10Incident Tooling: ncredir redirects for status.wiki* --> status.wikimedia.org - https://phabricator.wikimedia.org/T318804 (10Vgutierrez) >>! In T318804#8639175, @BCornwall wrote: > Looking into it further, it seems this is a very possible change! nginx mappings/site names support wildcard...
[08:42:52] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42552/console" [puppet] - 10https://gerrit.wikimedia.org/r/939630 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm)
[08:44:11] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42553/console" [puppet] - 10https://gerrit.wikimedia.org/r/937573 (https://phabricator.wikimedia.org/T341669) (owner: 10Giuseppe Lavagetto)
[08:45:19] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' .
[08:45:50] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/939355 (https://phabricator.wikimedia.org/T341334) (owner: 10Jelto)
[08:46:07] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] kubernetes::master: Add confd config writing all sa certs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/939630 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm)
[08:47:46] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "I've fixed two issues that where uncovered when writing the followup patch. Please double check when you have a minute." [puppet] - 10https://gerrit.wikimedia.org/r/937573 (https://phabricator.wikimedia.org/T341669) (owner: 10Giuseppe Lavagetto)
[08:48:24] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49596 and previous config saved to /var/cache/conftool/dbconfig/20230719-084823-root.json
[08:49:47] <wikibugs>	 (03PS16) 10JMeybohm: Update ipoid to certmanager [deployment-charts] - 10https://gerrit.wikimedia.org/r/937959 (https://phabricator.wikimedia.org/T300033)
[08:51:05] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Update ipoid to certmanager [deployment-charts] - 10https://gerrit.wikimedia.org/r/937959 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm)
[08:51:36] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/939377 (https://phabricator.wikimedia.org/T342182) (owner: 10BCornwall)
[08:51:55] <wikibugs>	 (03Merged) 10jenkins-bot: Update ipoid to certmanager [deployment-charts] - 10https://gerrit.wikimedia.org/r/937959 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm)
[08:54:49] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/939381 (https://phabricator.wikimedia.org/T342182) (owner: 10BCornwall)
[08:56:39] <icinga-wm>	 RECOVERY - mediawiki-installation DSH group on parse1002 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[08:57:01] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1180 (re)pooling @ 75%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49597 and previous config saved to /var/cache/conftool/dbconfig/20230719-085700-root.json
[09:03:29] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49598 and previous config saved to /var/cache/conftool/dbconfig/20230719-090328-root.json
[09:12:06] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1180 (re)pooling @ 100%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49599 and previous config saved to /var/cache/conftool/dbconfig/20230719-091205-root.json
[09:14:08] <logmsgbot>	 !log btullis@deploy1002 Started deploy [airflow-dags/analytics_test@be05071]: (no justification provided)
[09:14:12] <logmsgbot>	 !log btullis@deploy1002 Finished deploy [airflow-dags/analytics_test@be05071]: (no justification provided) (duration: 00m 04s)
[09:20:38] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:22:24] <wikibugs>	 10SRE, 10ops-eqiad, 10Traffic: Relocate lvs1013-lvs1016 to rows E & F - https://phabricator.wikimedia.org/T341992 (10Vgutierrez) @RobH I'm seeing on cumin1001 logs, that you interrupted the reimage of lvs1013 by pressing Ctrl+C: ` 2023-07-18 16:01:28,549 robh 2034852 [INFO] Completed command '/usr/local/sbin...
[09:25:38] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (GET pods) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:32:35] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Unrelated DNS diffs shown if decommission and makevm cookbooks run at the same time - https://phabricator.wikimedia.org/T342130 (10jbond) >>! In T342130#9024276, @bking wrote: > Was thinking a bit more about this...would it work to do some minimal sanit...
[09:32:54] <wikibugs>	 (03PS1) 10JMeybohm: deployment_server::global_config: Use symlinks for cluster aliases [puppet] - 10https://gerrit.wikimedia.org/r/939636 (https://phabricator.wikimedia.org/T300033)
[09:33:57] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: ml-services: remove gpu from nllb model [deployment-charts] - 10https://gerrit.wikimedia.org/r/939637
[09:35:11] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42554/console" [puppet] - 10https://gerrit.wikimedia.org/r/939636 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm)
[09:36:18] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] ml-services: remove gpu from nllb model [deployment-charts] - 10https://gerrit.wikimedia.org/r/939637 (owner: 10Ilias Sarantopoulos)
[09:37:04] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] ml-services: remove gpu from nllb model [deployment-charts] - 10https://gerrit.wikimedia.org/r/939637 (owner: 10Ilias Sarantopoulos)
[09:38:27] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: remove gpu from nllb model [deployment-charts] - 10https://gerrit.wikimedia.org/r/939637 (owner: 10Ilias Sarantopoulos)
[09:38:39] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] puppetserver: use FQDN in metric [puppet] - 10https://gerrit.wikimedia.org/r/939362 (https://phabricator.wikimedia.org/T342125) (owner: 10Jbond)
[09:38:50] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1 C: 03+2] deployment_server::global_config: Use symlinks for cluster aliases [puppet] - 10https://gerrit.wikimedia.org/r/939636 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm)
[09:39:13] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: remove gpu from nllb model [deployment-charts] - 10https://gerrit.wikimedia.org/r/939637 (owner: 10Ilias Sarantopoulos)
[09:39:24] <jbond>	 jayme: happy for me to merge you rcr
[09:39:32] <jayme>	 jbond: yes please
[09:39:46] <jbond>	 done
[09:39:51] <jayme>	 thanks
[09:43:28] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' .
[09:47:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[09:48:06] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/ipoid: apply
[09:50:48] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/ipoid: apply
[09:52:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[09:54:15] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/ipoid: apply
[09:54:23] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/ipoid: apply
[09:55:53] <wikibugs>	 (03PS1) 10Elukey: ml-services: bump Docker image for ores-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/939640 (https://phabricator.wikimedia.org/T341479)
[09:58:27] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C: 03+1] ml-services: bump Docker image for ores-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/939640 (https://phabricator.wikimedia.org/T341479) (owner: 10Elukey)
[09:58:42] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] ml-services: bump Docker image for ores-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/939640 (https://phabricator.wikimedia.org/T341479) (owner: 10Elukey)
[10:00:04] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230719T1000)
[10:02:07] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' .
[10:04:12] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): spicerack: update spicerack to work with the newer puppet infrastructure - https://phabricator.wikimedia.org/T341496 (10jbond)
[10:06:16] <wikibugs>	 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T342197 (10phaultfinder)
[10:11:07] <wikibugs>	 (03PS1) 10Jbond: puppetserver: do not notify puppetserver service on changes [puppet] - 10https://gerrit.wikimedia.org/r/939643 (https://phabricator.wikimedia.org/T330490)
[10:13:27] <wikibugs>	 (03PS2) 10Jbond: puppetserver: do not notify puppetserver service on changes [puppet] - 10https://gerrit.wikimedia.org/r/939643 (https://phabricator.wikimedia.org/T330490)
[10:13:34] <wikibugs>	 (03PS1) 10Btullis: Failover hive services to standby server [dns] - 10https://gerrit.wikimedia.org/r/939644 (https://phabricator.wikimedia.org/T329716)
[10:13:53] <wikibugs>	 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T342197 (10phaultfinder)
[10:15:45] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Failover hive services to standby server [dns] - 10https://gerrit.wikimedia.org/r/939644 (https://phabricator.wikimedia.org/T329716) (owner: 10Btullis)
[10:16:18] <wikibugs>	 (03PS3) 10Jbond: puppetserver: do not notify puppetserver service on changes [puppet] - 10https://gerrit.wikimedia.org/r/939643 (https://phabricator.wikimedia.org/T330490)
[10:17:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[10:18:24] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) load-categories-daily.service Failed on wdqs2021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:19:38] <wikibugs>	 (03PS4) 10Jbond: puppetserver: do not notify puppetserver service on changes [puppet] - 10https://gerrit.wikimedia.org/r/939643 (https://phabricator.wikimedia.org/T330490)
[10:22:57] <wikibugs>	 (03CR) 10Klausman: "This change is ready for review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/939647 (owner: 10Klausman)
[10:23:01] <wikibugs>	 (03PS4) 10Urbanecm: IP Masking: Enable for cswiki beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938915 (https://phabricator.wikimedia.org/T342034)
[10:23:59] <wikibugs>	 (03PS5) 10Urbanecm: IP Masking: Enable for cswiki beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938915 (https://phabricator.wikimedia.org/T342034)
[10:26:15] <wikibugs>	 (03PS6) 10Urbanecm: IP Masking: Enable for cswiki beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938915 (https://phabricator.wikimedia.org/T342034)
[10:26:52] <wikibugs>	 (03CR) 10Urbanecm: IP Masking: Enable for cswiki beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938915 (https://phabricator.wikimedia.org/T342034) (owner: 10Urbanecm)
[10:27:50] <wikibugs>	 (03CR) 10Urbanecm: IP Masking: Enable for cswiki beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938915 (https://phabricator.wikimedia.org/T342034) (owner: 10Urbanecm)
[10:29:20] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Infrastructure, and 2 others: update systems to use new puppetdb instance - https://phabricator.wikimedia.org/T342214 (10jbond)
[10:29:31] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Infrastructure, and 2 others: update systems to use new puppetdb instance - https://phabricator.wikimedia.org/T342214 (10jbond) 05Open→03In progress p:05Triage→03Medium
[10:29:38] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10jbond)
[10:30:42] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10jbond)
[10:30:55] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): puppetserver monitoring - https://phabricator.wikimedia.org/T342125 (10jbond) 05In progress→03Stalled this is now stalled until we move the old puppetmasteres to the new puppetdb instance
[10:43:22] <wikibugs>	 (03PS1) 10Gmodena: data-engineering: flink: alert based on active site [alerts] - 10https://gerrit.wikimedia.org/r/939651
[10:44:10] <wikibugs>	 (03PS1) 10Jbond: puppetboard: update main site to support service -next [puppet] - 10https://gerrit.wikimedia.org/r/939652 (https://phabricator.wikimedia.org/T342125)
[10:44:33] <wikibugs>	 (03PS3) 10EoghanGaffney: Remove references to releases1002/releases2002 for decom [puppet] - 10https://gerrit.wikimedia.org/r/938889 (https://phabricator.wikimedia.org/T334435)
[10:47:02] <wikibugs>	 (03PS1) 10Btullis: Install MariaDB to db1208 [puppet] - 10https://gerrit.wikimedia.org/r/939653 (https://phabricator.wikimedia.org/T334055)
[10:47:04] <wikibugs>	 (03PS1) 10Btullis: Switch references from db1108 to db1208 [puppet] - 10https://gerrit.wikimedia.org/r/939654 (https://phabricator.wikimedia.org/T334055)
[10:48:25] <wikibugs>	 (03PS2) 10Jbond: puppetboard: update main site to support service -next [puppet] - 10https://gerrit.wikimedia.org/r/939652 (https://phabricator.wikimedia.org/T342125)
[10:48:27] <wikibugs>	 (03PS1) 10Jbond: services: swap puppetboard and puppetboard-next [puppet] - 10https://gerrit.wikimedia.org/r/939655 (https://phabricator.wikimedia.org/T342214)
[10:49:21] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42556/console" [puppet] - 10https://gerrit.wikimedia.org/r/939653 (https://phabricator.wikimedia.org/T334055) (owner: 10Btullis)
[10:49:33] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42557/console" [puppet] - 10https://gerrit.wikimedia.org/r/939655 (https://phabricator.wikimedia.org/T342214) (owner: 10Jbond)
[10:50:40] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42558/console" [puppet] - 10https://gerrit.wikimedia.org/r/939655 (https://phabricator.wikimedia.org/T342214) (owner: 10Jbond)
[10:53:09] <wikibugs>	 (03PS1) 10Milimetric: rest-gateway: add route for metrics/knowledge-gap [deployment-charts] - 10https://gerrit.wikimedia.org/r/939656 (https://phabricator.wikimedia.org/T342213)
[10:55:16] <jinxer-wm>	 (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[10:55:38] <wikibugs>	 (03PS2) 10Jbond: services: swap puppetboard and puppetboard-next [puppet] - 10https://gerrit.wikimedia.org/r/939655 (https://phabricator.wikimedia.org/T342214)
[10:58:51] <wikibugs>	 (03CR) 10Marostegui: [C: 04-1] "db1208 doesn't have data yet. It first needs to be recloned." [puppet] - 10https://gerrit.wikimedia.org/r/939654 (https://phabricator.wikimedia.org/T334055) (owner: 10Btullis)
[10:59:07] <logmsgbot>	 !log jebe@deploy1002 Started deploy [analytics/refinery@eaabff2]: Regular analytics weekly train [analytics/refinery@eaabff2]
[11:00:16] <jinxer-wm>	 (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[11:03:24] <wikibugs>	 (03PS3) 10Jbond: puppetboard: update main site to support service -next [puppet] - 10https://gerrit.wikimedia.org/r/939652 (https://phabricator.wikimedia.org/T342125)
[11:03:26] <wikibugs>	 (03PS3) 10Jbond: services: swap puppetboard and puppetboard-next [puppet] - 10https://gerrit.wikimedia.org/r/939655 (https://phabricator.wikimedia.org/T342214)
[11:04:24] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42560/console" [puppet] - 10https://gerrit.wikimedia.org/r/939652 (https://phabricator.wikimedia.org/T342125) (owner: 10Jbond)
[11:07:01] <wikibugs>	 (03PS4) 10Jbond: puppetboard: update main site to support service -next [puppet] - 10https://gerrit.wikimedia.org/r/939652 (https://phabricator.wikimedia.org/T342125)
[11:07:03] <wikibugs>	 (03PS4) 10Jbond: services: swap puppetboard and puppetboard-next [puppet] - 10https://gerrit.wikimedia.org/r/939655 (https://phabricator.wikimedia.org/T342214)
[11:08:09] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42561/console" [puppet] - 10https://gerrit.wikimedia.org/r/939652 (https://phabricator.wikimedia.org/T342125) (owner: 10Jbond)
[11:09:32] <logmsgbot>	 !log jebe@deploy1002 Finished deploy [analytics/refinery@eaabff2]: Regular analytics weekly train [analytics/refinery@eaabff2] (duration: 10m 24s)
[11:11:35] <wikibugs>	 (03PS5) 10Jbond: puppetboard: update main site to support service -next [puppet] - 10https://gerrit.wikimedia.org/r/939652 (https://phabricator.wikimedia.org/T342125)
[11:11:37] <wikibugs>	 (03PS5) 10Jbond: services: swap puppetboard and puppetboard-next [puppet] - 10https://gerrit.wikimedia.org/r/939655 (https://phabricator.wikimedia.org/T342214)
[11:11:38] <logmsgbot>	 !log jebe@deploy1002 Started deploy [analytics/refinery@eaabff2] (thin): Regular analytics weekly train THIN [analytics/refinery@eaabff2]
[11:11:42] <logmsgbot>	 !log jebe@deploy1002 Finished deploy [analytics/refinery@eaabff2] (thin): Regular analytics weekly train THIN [analytics/refinery@eaabff2] (duration: 00m 04s)
[11:12:04] <logmsgbot>	 !log jebe@deploy1002 Started deploy [analytics/refinery@eaabff2] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@eaabff2]
[11:12:39] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42562/console" [puppet] - 10https://gerrit.wikimedia.org/r/939652 (https://phabricator.wikimedia.org/T342125) (owner: 10Jbond)
[11:13:47] <logmsgbot>	 !log jebe@deploy1002 Finished deploy [analytics/refinery@eaabff2] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@eaabff2] (duration: 01m 43s)
[11:14:27] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] puppetboard: update main site to support service -next [puppet] - 10https://gerrit.wikimedia.org/r/939652 (https://phabricator.wikimedia.org/T342125) (owner: 10Jbond)
[11:16:34] <wikibugs>	 (03PS1) 10Fabfur: haproxy: Add option to disable keepalive on port 80 [puppet] - 10https://gerrit.wikimedia.org/r/939661 (https://phabricator.wikimedia.org/T342211)
[11:26:43] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: acme_chief: ldap-codfw1dev: include private FQDNs [puppet] - 10https://gerrit.wikimedia.org/r/939662 (https://phabricator.wikimedia.org/T342185)
[11:28:00] <wikibugs>	 (03PS1) 10Ssingh: durum: bind anycast-healthchecker.service to nginx.service [puppet] - 10https://gerrit.wikimedia.org/r/939663
[11:28:59] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42563/console" [puppet] - 10https://gerrit.wikimedia.org/r/939663 (owner: 10Ssingh)
[11:31:28] <wikibugs>	 (03PS2) 10Fabfur: haproxy: Add option to disable keepalive on port 80 [puppet] - 10https://gerrit.wikimedia.org/r/939661 (https://phabricator.wikimedia.org/T342211)
[11:31:30] <jinxer-wm>	 (Traffic bill over quota) firing: Alert for device cr2-drmrs.wikimedia.org - Traffic bill over quota got acknowledged   - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota
[11:33:40] <wikibugs>	 (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (DIFF 3 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42564/console" [puppet] - 10https://gerrit.wikimedia.org/r/939661 (https://phabricator.wikimedia.org/T342211) (owner: 10Fabfur)
[11:33:44] <wikibugs>	 (03PS6) 10Jbond: services: swap puppetboard and puppetboard-next [puppet] - 10https://gerrit.wikimedia.org/r/939655 (https://phabricator.wikimedia.org/T342214)
[11:33:46] <wikibugs>	 (03PS1) 10Jbond: puppetboard: create a new site for puppetboard-next [puppet] - 10https://gerrit.wikimedia.org/r/939665 (https://phabricator.wikimedia.org/T342125)
[11:34:47] <wikibugs>	 (03Abandoned) 10Btullis: Add the refinery-cache/revs directory to git safe list [puppet] - 10https://gerrit.wikimedia.org/r/922905 (https://phabricator.wikimedia.org/T334493) (owner: 10Stevemunene)
[11:34:53] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42565/console" [puppet] - 10https://gerrit.wikimedia.org/r/939655 (https://phabricator.wikimedia.org/T342214) (owner: 10Jbond)
[11:36:11] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] Install MariaDB to db1208 [puppet] - 10https://gerrit.wikimedia.org/r/939653 (https://phabricator.wikimedia.org/T334055) (owner: 10Btullis)
[11:36:14] <wikibugs>	 (03CR) 10Btullis: Add the refinery-cache/revs directory to git safe list (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/922905 (https://phabricator.wikimedia.org/T334493) (owner: 10Stevemunene)
[11:38:07] <wikibugs>	 (03PS1) 10Jennifer Ebe: Update refine jobs with new var version [puppet] - 10https://gerrit.wikimedia.org/r/939667
[11:38:25] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10User-aborrero, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudservices1006.eqiad.wmnet - https://phabricator.wikimedia.org/T342161 (10aborrero)
[11:39:45] <wikibugs>	 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2022/2023-Q4): eqiad1: cloudlb: reimage cloudcontrol1005 into new network setup - https://phabricator.wikimedia.org/T341495 (10aborrero) hey @Jclark-ctr if you have more than one cloud-related tasks to do on-site, please give highest priority to...
[11:40:01] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.decommission for hosts dbproxy1016.eqiad.wmnet
[11:45:08] <wikibugs>	 (03PS2) 10Ladsgroup: realm: Add two new private tables of CheckUser [puppet] - 10https://gerrit.wikimedia.org/r/938841 (https://phabricator.wikimedia.org/T341076)
[11:45:12] <wikibugs>	 (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] realm: Add two new private tables of CheckUser [puppet] - 10https://gerrit.wikimedia.org/r/938841 (https://phabricator.wikimedia.org/T341076) (owner: 10Ladsgroup)
[11:45:22] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.dns.netbox
[11:45:44] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42566/console" [puppet] - 10https://gerrit.wikimedia.org/r/939652 (https://phabricator.wikimedia.org/T342125) (owner: 10Jbond)
[11:45:53] <wikibugs>	 (03CR) 10Majavah: [C: 04-1] "acme-chief/LE can't issue certificates for .wmnet names" [puppet] - 10https://gerrit.wikimedia.org/r/939662 (https://phabricator.wikimedia.org/T342185) (owner: 10Arturo Borrero Gonzalez)
[11:47:34] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dbproxy1016.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - ladsgroup@cumin1001"
[11:48:03] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42567/console" [puppet] - 10https://gerrit.wikimedia.org/r/939665 (https://phabricator.wikimedia.org/T342125) (owner: 10Jbond)
[11:48:14] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: acme_chief: ldap-codfw1dev: include private FQDNs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/939662 (https://phabricator.wikimedia.org/T342185) (owner: 10Arturo Borrero Gonzalez)
[11:48:19] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] puppetboard: create a new site for puppetboard-next [puppet] - 10https://gerrit.wikimedia.org/r/939665 (https://phabricator.wikimedia.org/T342125) (owner: 10Jbond)
[11:48:55] <wikibugs>	 (03PS1) 10Ayounsi: Fix some pylint errors [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/939669
[11:49:28] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Fix some pylint errors [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/939669 (owner: 10Ayounsi)
[11:50:19] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dbproxy1016.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - ladsgroup@cumin1001"
[11:50:19] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[11:50:20] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts dbproxy1016.eqiad.wmnet
[11:51:30] <jinxer-wm>	 (Traffic bill over quota) resolved: Alert for device cr2-drmrs.wikimedia.org - Traffic bill over quota got acknowledged   - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota
[11:53:29] <icinga-wm>	 PROBLEM - puppetboard-next.wikimedia.org requires authentication on puppetboard1003 is CRITICAL: HTTP CRITICAL: Status line output matched HTTP/1.1 302 - header ocation: https://idp.wikim... not found on https://puppetboard-next.wikimedia.org:443/ - 574 bytes in 0.048 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[11:54:04] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Update refine jobs with new var version [puppet] - 10https://gerrit.wikimedia.org/r/939667 (owner: 10Jennifer Ebe)
[11:54:40] <wikibugs>	 (03PS1) 10Jbond: puppetboard: should point to production and not use saml [puppet] - 10https://gerrit.wikimedia.org/r/939670
[11:55:29] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppetboard: should point to production and not use saml [puppet] - 10https://gerrit.wikimedia.org/r/939670 (owner: 10Jbond)
[11:56:19] <icinga-wm>	 RECOVERY - Check systemd state on stat1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:01:56] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for cyndywikime - https://phabricator.wikimedia.org/T342230 (10Cyndymediawiksim)
[12:02:04] <wikibugs>	 (03PS1) 10Jbond: puppetboard: add port [puppet] - 10https://gerrit.wikimedia.org/r/939672
[12:02:18] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: analytics_meta on db1108 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[12:02:40] <icinga-wm>	 PROBLEM - MariaDB Replica IO: matomo on db1108 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[12:03:12] <wikibugs>	 (03CR) 10Btullis: [V: 03+1 C: 03+2] Install MariaDB to db1208 [puppet] - 10https://gerrit.wikimedia.org/r/939653 (https://phabricator.wikimedia.org/T334055) (owner: 10Btullis)
[12:04:38] <icinga-wm>	 PROBLEM - MariaDB read only analytics_meta on db1108 is CRITICAL: Could not connect to localhost:3352 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[12:04:38] <icinga-wm>	 PROBLEM - MariaDB read only matomo on db1108 is CRITICAL: Could not connect to localhost:3351 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[12:04:52] <icinga-wm>	 PROBLEM - mysqld processes on db1108 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[12:04:56] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppetboard: add port [puppet] - 10https://gerrit.wikimedia.org/r/939672 (owner: 10Jbond)
[12:05:06] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: matomo on db1108 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[12:05:38] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] "root@deploy1002:/home/elukey# istioctl-1.15.7 manifest diff /srv/deployment-charts/custom_deploy.d/istio/ml-serve/config.yaml config.yaml" [deployment-charts] - 10https://gerrit.wikimedia.org/r/939647 (owner: 10Klausman)
[12:06:17] <wikibugs>	 (03CR) 10Elukey: "Sorry just seen, could you apply the same change for the gateway services instance?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/939647 (owner: 10Klausman)
[12:06:44] <icinga-wm>	 ACKNOWLEDGEMENT - MariaDB Replica IO: matomo on db1108 is CRITICAL: CRITICAL slave_io_state could not connect Btullis T334055 - migratnig to db1208 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[12:06:44] <icinga-wm>	 ACKNOWLEDGEMENT - MariaDB Replica Lag: analytics_meta on db1108 is CRITICAL: CRITICAL slave_sql_lag could not connect Btullis T334055 - migratnig to db1208 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[12:06:44] <icinga-wm>	 ACKNOWLEDGEMENT - MariaDB Replica Lag: matomo on db1108 is CRITICAL: CRITICAL slave_sql_lag could not connect Btullis T334055 - migratnig to db1208 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[12:06:44] <icinga-wm>	 ACKNOWLEDGEMENT - MariaDB Replica SQL: analytics_meta on db1108 is CRITICAL: CRITICAL slave_sql_state could not connect Btullis T334055 - migratnig to db1208 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[12:06:44] <icinga-wm>	 ACKNOWLEDGEMENT - MariaDB Replica SQL: matomo on db1108 is CRITICAL: CRITICAL slave_sql_state could not connect Btullis T334055 - migratnig to db1208 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[12:06:44] <icinga-wm>	 ACKNOWLEDGEMENT - MariaDB read only analytics_meta on db1108 is CRITICAL: Could not connect to localhost:3352 Btullis T334055 - migratnig to db1208 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[12:06:44] <icinga-wm>	 ACKNOWLEDGEMENT - MariaDB read only matomo on db1108 is CRITICAL: Could not connect to localhost:3351 Btullis T334055 - migratnig to db1208 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[12:06:45] <icinga-wm>	 ACKNOWLEDGEMENT - mysqld processes on db1108 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld Btullis T334055 - migratnig to db1208 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[12:07:51] <wikibugs>	 (03PS2) 10Elukey: ml-services/istio: Increase memory quota to 1.5Gi [deployment-charts] - 10https://gerrit.wikimedia.org/r/939647 (owner: 10Klausman)
[12:08:26] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] ml-services/istio: Increase memory quota to 1.5Gi [deployment-charts] - 10https://gerrit.wikimedia.org/r/939647 (owner: 10Klausman)
[12:10:24] <wikibugs>	 (03PS3) 10Elukey: ml-services/istio: Increase memory quota to 1.5Gi [deployment-charts] - 10https://gerrit.wikimedia.org/r/939647 (owner: 10Klausman)
[12:11:42] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] services: swap puppetboard and puppetboard-next [puppet] - 10https://gerrit.wikimedia.org/r/939655 (https://phabricator.wikimedia.org/T342214) (owner: 10Jbond)
[12:11:50] <wikibugs>	 (03PS7) 10Jbond: services: swap puppetboard and puppetboard-next [puppet] - 10https://gerrit.wikimedia.org/r/939655 (https://phabricator.wikimedia.org/T342214)
[12:12:20] <icinga-wm>	 RECOVERY - puppetboard-next.wikimedia.org requires authentication on puppetboard1003 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 564 bytes in 0.050 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[12:15:14] <icinga-wm>	 RECOVERY - Host lsw1-f2-eqiad.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.75 ms
[12:15:18] <icinga-wm>	 RECOVERY - Host ps1-f2-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.43 ms
[12:16:15] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] "root@deploy1002:/home/elukey# istioctl-1.15.7 manifest diff /srv/deployment-charts/custom_deploy.d/istio/ml-serve/config.yaml config.yaml" [deployment-charts] - 10https://gerrit.wikimedia.org/r/939647 (owner: 10Klausman)
[12:17:03] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.dns.wipe-cache puppetboard-next.discovery.wmnet on all recursors
[12:17:06] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) puppetboard-next.discovery.wmnet on all recursors
[12:17:12] <wikibugs>	 (03CR) 10Vgutierrez: [C: 04-1] "small fix needed, looking good though" [puppet] - 10https://gerrit.wikimedia.org/r/939661 (https://phabricator.wikimedia.org/T342211) (owner: 10Fabfur)
[12:17:21] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.dns.wipe-cache puppetboard.discovery.wmnet on all recursors
[12:17:22] <wikibugs>	 (03PS1) 10ArielGlenn: Make sure that rsync runs only on the primary dumps nfs shares [puppet] - 10https://gerrit.wikimedia.org/r/939674 (https://phabricator.wikimedia.org/T325232)
[12:17:24] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) puppetboard.discovery.wmnet on all recursors
[12:18:22] <wikibugs>	 10ops-eqiad, 10DBA, 10decommission-hardware: decommission dbproxy1016.eqiad.wmnet - https://phabricator.wikimedia.org/T342224 (10Ladsgroup) a:05Ladsgroup→03wiki_willy
[12:19:36] <wikibugs>	 (03PS1) 10Jbond: puppetboard: add -next domain to tls certs [puppet] - 10https://gerrit.wikimedia.org/r/939675 (https://phabricator.wikimedia.org/T342214)
[12:20:19] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppetboard: add -next domain to tls certs [puppet] - 10https://gerrit.wikimedia.org/r/939675 (https://phabricator.wikimedia.org/T342214) (owner: 10Jbond)
[12:22:24] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/ipoid: apply
[12:22:28] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/ipoid: apply
[12:22:33] <jbond>	 !log switch puppertboard.wikimedia.oreg to use puppet7 infrastructre
[12:22:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:22:54] <wikibugs>	 (03PS3) 10Fabfur: haproxy: Add option to disable keepalive on port 80 [puppet] - 10https://gerrit.wikimedia.org/r/939661 (https://phabricator.wikimedia.org/T342211)
[12:23:10] <wikibugs>	 (03CR) 10Vgutierrez: [C: 04-2] "as stated already by Majavah, Let's Encrypt can't validate SNIs for private domains (non-reachable from the Internet). Please take into ac" [puppet] - 10https://gerrit.wikimedia.org/r/939662 (https://phabricator.wikimedia.org/T342185) (owner: 10Arturo Borrero Gonzalez)
[12:24:48] <wikibugs>	 (03CR) 10Fabfur: haproxy: Add option to disable keepalive on port 80 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/939661 (https://phabricator.wikimedia.org/T342211) (owner: 10Fabfur)
[12:25:41] <wikibugs>	 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T342197 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr powercycled switch link icinga  errors have cleared
[12:26:54] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for cyndywikime - https://phabricator.wikimedia.org/T342230 (10Aklapper) Hi, as a side note, there is currently no connection which could be verified between that staff account on wikitech (which uses a `@wikimedia.org` email address) and the Phabricator accou...
[12:27:22] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/services/ipoid: apply
[12:27:57] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/services/ipoid: apply
[12:29:38] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:32:47] <wikibugs>	 (03Abandoned) 10Arturo Borrero Gonzalez: acme_chief: ldap-codfw1dev: include private FQDNs [puppet] - 10https://gerrit.wikimedia.org/r/939662 (https://phabricator.wikimedia.org/T342185) (owner: 10Arturo Borrero Gonzalez)
[12:34:38] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:35:44] <wikibugs>	 (03PS2) 10Ayounsi: Fix some pylint errors [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/939669
[12:36:20] <wikibugs>	 (03PS1) 10Jbond: puppetdb-api-next: add new discovery record for testing puppetdb-api [dns] - 10https://gerrit.wikimedia.org/r/939678 (https://phabricator.wikimedia.org/T342214)
[12:40:10] <wikibugs>	 (03PS1) 10Jbond: puppetdb-api-next: Add new puppetdb-api discovery record [puppet] - 10https://gerrit.wikimedia.org/r/939679 (https://phabricator.wikimedia.org/T342214)
[12:43:06] <logmsgbot>	 !log joal@deploy1002 Started deploy [airflow-dags/analytics@87be328]: Refactor cassandra loading jobs
[12:43:20] <logmsgbot>	 !log joal@deploy1002 Finished deploy [airflow-dags/analytics@87be328]: Refactor cassandra loading jobs (duration: 00m 14s)
[12:46:41] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for cyndywikime - https://phabricator.wikimedia.org/T342230 (10Cyndymediawiksim) >>! In T342230#9027650, @Aklapper wrote: > Hi, as a side note, there is currently no connection which could be verified between that staff account on wikitech (which uses a `@wiki...
[12:46:52] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Cyndymediawiksim - https://phabricator.wikimedia.org/T342230 (10Cyndymediawiksim)
[12:48:44] <wikibugs>	 (03PS1) 10Ayounsi: WIP: first scaffolding fo gNMI support [software/homer] - 10https://gerrit.wikimedia.org/r/939681
[12:50:31] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] WIP: first scaffolding fo gNMI support [software/homer] - 10https://gerrit.wikimedia.org/r/939681 (owner: 10Ayounsi)
[12:58:20] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] "LGTM, should we try this in mwdebug somehow?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938644 (https://phabricator.wikimedia.org/T341859) (owner: 10Giuseppe Lavagetto)
[12:58:34] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/939661 (https://phabricator.wikimedia.org/T342211) (owner: 10Fabfur)
[12:58:38] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] "Nice" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938645 (https://phabricator.wikimedia.org/T341859) (owner: 10Giuseppe Lavagetto)
[13:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230719T1300).
[13:00:05] <jouncebot>	 subbu and cscott: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:14] <subbu>	 o/
[13:00:22] <Lucas_WMDE>	 I can deploy today!
[13:00:41] <TheresNoTime>	 o/
[13:01:21] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Fix incorrect use of UseLegacyMediaStyles (missing "wg" prefix) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939374 (https://phabricator.wikimedia.org/T318433) (owner: 10Subramanya Sastry)
[13:01:25] <wikibugs>	 (03PS2) 10Lucas Werkmeister (WMDE): Fix incorrect use of UseLegacyMediaStyles (missing "wg" prefix) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939374 (https://phabricator.wikimedia.org/T318433) (owner: 10Subramanya Sastry)
[13:01:25] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939374 (https://phabricator.wikimedia.org/T318433) (owner: 10Subramanya Sastry)
[13:01:49] <aanzx>	 Lucas_WMDE: I am adding one patch
[13:01:51] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Platform-SRE, 10Infrastructure-Foundations: analytics1073 and analytics1075 - loss of connectivity - https://phabricator.wikimedia.org/T342141 (10Papaul) @BTullis yes that is a possibility too to use the 10G nic since those 2 nodes each has 4x1G nic and 2x10G nic. There are 2 way...
[13:01:53] <Lucas_WMDE>	 ok
[13:02:13] * Lucas_WMDE wonders if IS.php should have linting against array keys not starting with wg* or wmg*
[13:02:23] <wikibugs>	 (03Merged) 10jenkins-bot: Fix incorrect use of UseLegacyMediaStyles (missing "wg" prefix) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939374 (https://phabricator.wikimedia.org/T318433) (owner: 10Subramanya Sastry)
[13:02:52] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:939374|Fix incorrect use of UseLegacyMediaStyles (missing "wg" prefix) (T318433)]]
[13:02:56] <stashbot>	 T318433: Templates (and extensions) that mimic parser media output need migration to new structure - https://phabricator.wikimedia.org/T318433
[13:04:27] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 ssastry and lucaswerkmeister-wmde: Backport for [[gerrit:939374|Fix incorrect use of UseLegacyMediaStyles (missing "wg" prefix) (T318433)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[13:04:42] <Lucas_WMDE>	 subbu: can you test the parsoid styles on mwdebug?
[13:04:47] <subbu>	 yes, testing.
[13:06:47] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Platform-SRE, 10Infrastructure-Foundations: analytics1073 and analytics1075 - loss of connectivity - https://phabricator.wikimedia.org/T342141 (10Jclark-ctr) a:03Jclark-ctr
[13:07:17] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Platform-SRE, 10Infrastructure-Foundations: analytics1073 and analytics1075 - loss of connectivity - https://phabricator.wikimedia.org/T342141 (10Jclark-ctr) @BTullis  I replaced both sfpt and link returned
[13:07:56] <subbu>	 it took a bunch of purging and ctrl-r's but it works now. please sync.
[13:08:04] <Lucas_WMDE>	 ok, thanks
[13:09:45] <Lucas_WMDE>	 (syncing)
[13:12:28] <wikibugs>	 (03PS4) 10Fabfur: haproxy: Add option to disable keepalive on port 80 [puppet] - 10https://gerrit.wikimedia.org/r/939661 (https://phabricator.wikimedia.org/T342211)
[13:12:47] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Cyndymediawiksim - https://phabricator.wikimedia.org/T342230 (10Aklapper) @Cyndymediawiksim Thanks for connecting the LDAP account. Please also connect the correct MediaWiki SUL account (created by WMF ITS instead of self-created) if this Phabricator accou...
[13:13:19] <wikibugs>	 (03CR) 10Fabfur: haproxy: Add option to disable keepalive on port 80 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/939661 (https://phabricator.wikimedia.org/T342211) (owner: 10Fabfur)
[13:13:38] <Lucas_WMDE>	 made T342249 for catching this mistake in CI, though I’m not sure if Wikimedia-Site-requests is the right phab tag or not
[13:13:39] <stashbot>	 T342249: Prevent incorrect variable name prefix in InitialiseSettings.php - https://phabricator.wikimedia.org/T342249
[13:13:40] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:939374|Fix incorrect use of UseLegacyMediaStyles (missing "wg" prefix) (T318433)]] (duration: 10m 47s)
[13:13:43] <stashbot>	 T318433: Templates (and extensions) that mimic parser media output need migration to new structure - https://phabricator.wikimedia.org/T318433
[13:14:54] <Lucas_WMDE>	 aanzx: why is the phab task attached to your change already closed?
[13:15:22] <aanzx>	 Lucas_WMDE:  only workmark was updated
[13:15:30] <aanzx>	 on that task
[13:15:32] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host analytics1073.eqiad.wmnet with OS bullseye
[13:17:43] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Cyndymediawiksim - https://phabricator.wikimedia.org/T342230 (10Cyndymediawiksim) >>! In T342230#9027867, @Aklapper wrote: > @Cyndymediawiksim Thanks for connecting the LDAP account. Please also connect the correct MediaWiki SUL account (created by WMF ITS...
[13:18:09] <Lucas_WMDE>	 I don’t understand yet
[13:18:21] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp1[098-113] - https://phabricator.wikimedia.org/T342159 (10RobH)
[13:19:02] <aanzx>	 only wordmark was done by jdlrobson , not logo so i created new patch for logo
[13:19:34] <Lucas_WMDE>	 but I don’t see a logo change in https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/939682
[13:19:46] <Lucas_WMDE>	 the file that’s replaced is a wordmark, isn’t it?
[13:21:21] <aanzx>	 Lucas_WMDE: ok i dindn't cjheck, i will recreate new patch tomorow
[13:21:29] <wikibugs>	 (03PS1) 10Jbond: puppetdb: Set X-Client headers [puppet] - 10https://gerrit.wikimedia.org/r/939685 (https://phabricator.wikimedia.org/T342214)
[13:23:02] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] haproxy: Add option to disable keepalive on port 80 [puppet] - 10https://gerrit.wikimedia.org/r/939661 (https://phabricator.wikimedia.org/T342211) (owner: 10Fabfur)
[13:23:06] <wikibugs>	 (03Abandoned) 10Anzx: update knwikisource logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939682 (https://phabricator.wikimedia.org/T341912) (owner: 10Anzx)
[13:23:08] <Lucas_WMDE>	 but in any case, if the task isn’t done yet then it sounds like it should be reopened
[13:23:26] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppetdb: Set X-Client headers [puppet] - 10https://gerrit.wikimedia.org/r/939685 (https://phabricator.wikimedia.org/T342214) (owner: 10Jbond)
[13:23:46] <aanzx>	 ok i will reopen it, thanks
[13:23:49] <wikibugs>	 (03PS1) 10JMeybohm: wikifunctions: Update orchestrator and evaluator [deployment-charts] - 10https://gerrit.wikimedia.org/r/939686 (https://phabricator.wikimedia.org/T297314)
[13:23:52] <wikibugs>	 (03PS1) 10JMeybohm: wikifunctions: Enable mesh and ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/939687 (https://phabricator.wikimedia.org/T297314)
[13:23:56] <Lucas_WMDE>	 ok
[13:24:04] <Lucas_WMDE>	 anything else to deploy?
[13:24:22] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: noc: add script to dump etcd db config (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938644 (https://phabricator.wikimedia.org/T341859) (owner: 10Giuseppe Lavagetto)
[13:24:29] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wikifunctions: Update orchestrator and evaluator [deployment-charts] - 10https://gerrit.wikimedia.org/r/939686 (https://phabricator.wikimedia.org/T297314) (owner: 10JMeybohm)
[13:24:34] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wikifunctions: Enable mesh and ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/939687 (https://phabricator.wikimedia.org/T297314) (owner: 10JMeybohm)
[13:24:44] <aanzx>	 Nothing else Lucas_WMDE 
[13:26:11] <Lucas_WMDE>	 !log UTC afternoon backport+config window done
[13:26:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:26:58] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/939663 (owner: 10Ssingh)
[13:27:02] <fabfur>	 !log temporary disable puppet on cp3052 to apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/939661 (T342211)
[13:27:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:27:05] <stashbot>	 T342211: Disable keep-alive on HAProxy port 80 - https://phabricator.wikimedia.org/T342211
[13:28:02] <wikibugs>	 (03PS1) 10Elukey: Revert "ml-services: bump Docker image for ores-legacy" [deployment-charts] - 10https://gerrit.wikimedia.org/r/939339
[13:29:01] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Revert "ml-services: bump Docker image for ores-legacy" [deployment-charts] - 10https://gerrit.wikimedia.org/r/939339 (owner: 10Elukey)
[13:29:50] <wikibugs>	 (03PS1) 10Jbond: puppetdb: add allow-header-cert-info: true to auth.conf [puppet] - 10https://gerrit.wikimedia.org/r/939689 (https://phabricator.wikimedia.org/T342214)
[13:29:57] <fabfur>	 !log aborted previous operations, no need to disable puppet to apply that CR (https://gerrit.wikimedia.org/r/c/operations/puppet/+/939661) (T342211)
[13:30:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:30:08] <wikibugs>	 (03CR) 10Fabfur: [C: 03+2] haproxy: Add option to disable keepalive on port 80 [puppet] - 10https://gerrit.wikimedia.org/r/939661 (https://phabricator.wikimedia.org/T342211) (owner: 10Fabfur)
[13:31:09] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' .
[13:31:51] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42572/console" [puppet] - 10https://gerrit.wikimedia.org/r/939689 (https://phabricator.wikimedia.org/T342214) (owner: 10Jbond)
[13:32:16] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] puppetdb: add allow-header-cert-info: true to auth.conf [puppet] - 10https://gerrit.wikimedia.org/r/939689 (https://phabricator.wikimedia.org/T342214) (owner: 10Jbond)
[13:32:54] <wikibugs>	 (03PS4) 10Kimberly Sarabia: Turn off A/B Test in Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936333 (https://phabricator.wikimedia.org/T337956)
[13:33:51] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1 C: 03+2] durum: bind anycast-healthchecker.service to nginx.service [puppet] - 10https://gerrit.wikimedia.org/r/939663 (owner: 10Ssingh)
[13:35:28] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack C8 - https://phabricator.wikimedia.org/T308339 (10akosiaris) @robh mw hosts are 3 api servers and 3 appservers. You can do them anytime. Also it requires is a downtime and a poweroff per the description.
[13:36:04] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Platform-SRE, 10Infrastructure-Foundations: analytics1073 and analytics1075 - loss of connectivity - https://phabricator.wikimedia.org/T342141 (10BTullis) @Jclark-ctr - many thanks for doing that. I just checked with another run of the cookbook on analytics1073 and it doesn't loo...
[13:37:46] <sukhe>	 some BGP alerts expected because of flapping sessions with the bird restarts
[13:38:05] <sukhe>	 (on durum hosts)
[13:39:12] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Data Engineering and Event Platform Team, 10Data-Platform-SRE: Add tchin to analytics-admins - https://phabricator.wikimedia.org/T342146 (10andrea.denisse) 05Open→03Resolved Marking as resolved. :)
[13:39:36] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.ganeti.makevm for new host flink-zk1003.eqiad.wmnet
[13:39:38] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.dns.netbox
[13:40:10] <wikibugs>	 10SRE, 10Traffic: Disable keep-alive on HAProxy port 80 - https://phabricator.wikimedia.org/T342211 (10Fabfur) That was released @Wed 19 Jul 2023 01:32:50 PM UTC on cp3052.esams.wmnet to test. The results matches what we were expecting, so we'll deploy on all text@esams
[13:40:24] <wikibugs>	 (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/938897 (https://phabricator.wikimedia.org/T95064) (owner: 10Jbond)
[13:40:29] <wikibugs>	 (03PS1) 10Ssingh: durum: remove redundant whitespace in durum::common [puppet] - 10https://gerrit.wikimedia.org/r/939691
[13:42:05] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM flink-zk1003.eqiad.wmnet - bking@cumin1001"
[13:42:28] <wikibugs>	 (03CR) 10JHathaway: [C: 03+1] "looks good, if possible I would add comments on the shellcheck ignores, so our future selves understand why you needed to ignore them" [puppet] - 10https://gerrit.wikimedia.org/r/938898 (https://phabricator.wikimedia.org/T95064) (owner: 10Jbond)
[13:42:42] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Platform-SRE, 10Infrastructure-Foundations: analytics1073 and analytics1075 - loss of connectivity - https://phabricator.wikimedia.org/T342141 (10BTullis) > There are 2 ways you will be able to switch to using the 10G nic on those servers. 1- Decommission the server and provision...
[13:42:50] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM flink-zk1003.eqiad.wmnet - bking@cumin1001"
[13:42:50] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:42:50] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.dns.wipe-cache flink-zk1003.eqiad.wmnet on all recursors
[13:42:53] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) flink-zk1003.eqiad.wmnet on all recursors
[13:42:56] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.dns.netbox
[13:43:12] <wikibugs>	 (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/936692 (https://phabricator.wikimedia.org/T340947) (owner: 10Jbond)
[13:43:24] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] durum: remove redundant whitespace in durum::common [puppet] - 10https://gerrit.wikimedia.org/r/939691 (owner: 10Ssingh)
[13:43:31] <wikibugs>	 (03PS1) 10Lucas Werkmeister (WMDE): tests: Test setting names [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939694 (https://phabricator.wikimedia.org/T342249)
[13:44:53] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Platform-SRE, 10Infrastructure-Foundations: analytics1073 and analytics1075 - loss of connectivity - https://phabricator.wikimedia.org/T342141 (10Papaul) the interface came up an went down  ` papaul@asw2-b-eqiad> show interfaces descriptions ge-7/0/15 Interface       Admin Link D...
[13:45:10] <wikibugs>	 10SRE, 10Data-Platform-SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10bking)
[13:45:46] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM flink-zk1003.eqiad.wmnet - bking@cumin1001"
[13:45:49] <wikibugs>	 10SRE, 10Data Engineering and Event Platform Team, 10MW-on-K8s, 10serviceops: Migrate flink-cluster-taskmanager to connect to mw-on-k8s - https://phabricator.wikimedia.org/T342252 (10Joe)
[13:45:52] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Platform-SRE, 10Infrastructure-Foundations: analytics1073 and analytics1075 - loss of connectivity - https://phabricator.wikimedia.org/T342141 (10Papaul) right now 1075 is showing up  ` papaul@asw2-c-eqiad> show interfaces descriptions | match analytics1075 ge-7/0/5        up...
[13:46:31] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM flink-zk1003.eqiad.wmnet - bking@cumin1001"
[13:46:31] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:46:31] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.dns.wipe-cache flink-zk1003.eqiad.wmnet on all recursors
[13:46:34] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) flink-zk1003.eqiad.wmnet on all recursors
[13:46:43] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host flink-zk1003.eqiad.wmnet
[13:47:11] <wikibugs>	 (03CR) 10Majavah: [C: 03+1] "looks great!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939694 (https://phabricator.wikimedia.org/T342249) (owner: 10Lucas Werkmeister (WMDE))
[13:47:17] <wikibugs>	 10SRE, 10Data Engineering and Event Platform Team, 10MW-on-K8s, 10serviceops: Migrate rdf-streaming-updater to connect to mw-on-k8s - https://phabricator.wikimedia.org/T342252 (10Joe)
[13:48:34] <wikibugs>	 10SRE, 10Data Engineering and Event Platform Team, 10MW-on-K8s, 10serviceops: Migrate rdf-streaming-updater to connect to mw-on-k8s - https://phabricator.wikimedia.org/T342252 (10Joe) @dcausse not sure if you're the right person to ask, if not apologies; but I wanted to know if we're making any write reque...
[13:49:18] <Lucas_WMDE>	 taavi: thanks! should I “deploy” that test change now?
[13:49:26] <Lucas_WMDE>	 (or should it be reviewed by someone else, for example?)
[13:49:31] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Observability-Alerting, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): puppet JMX mappings - https://phabricator.wikimedia.org/T342253 (10jbond)
[13:49:58] <taavi>	 I'm not aware of any proper 'maintainers' for mw-config who would need to review it too
[13:50:12] <taavi>	 does it need a proper deployment or can you just git pull them to deploy1002?
[13:50:16] * Lucas_WMDE looks at a few other changes that touched tests/
[13:50:21] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] contint: replace Apache 2.2 access control syntax for Jenkins proxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/932440 (https://phabricator.wikimedia.org/T338071) (owner: 10Dzahn)
[13:50:31] <Lucas_WMDE>	 I would run `scap backport` and hope that it skips the sync
[13:50:33] <wikibugs>	 (03PS1) 10Filippo Giunchedi: icinga_exporter: team-tag netops icinga alerts [puppet] - 10https://gerrit.wikimedia.org/r/939695
[13:50:37] <Lucas_WMDE>	 like it does for beta changes
[13:51:03] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Platform-SRE, 10Infrastructure-Foundations: analytics1073 and analytics1075 - loss of connectivity - https://phabricator.wikimedia.org/T342141 (10BTullis) >>! In T342141#9028025, @Papaul wrote: > right now 1075 is showing up  > ` > papaul@asw2-c-eqiad> show interfaces description...
[13:51:10] <taavi>	 tests/ is not listed in beta_only_config_files in scap.cfg
[13:51:38] <Lucas_WMDE>	 ah ok
[13:52:00] <Lucas_WMDE>	 I thought maybe it skips anything that doesn’t touch known paths but sounds like the logic is the other way around
[13:52:05] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppetdb-api-next: Add new puppetdb-api discovery record [puppet] - 10https://gerrit.wikimedia.org/r/939679 (https://phabricator.wikimedia.org/T342214) (owner: 10Jbond)
[13:52:09] <Lucas_WMDE>	 just a pull is probably fine
[13:52:38] <Lucas_WMDE>	 https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/889819 and https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/888105 don’t look like a whole lot of extra review is needed
[13:52:42] <Lucas_WMDE>	 I’ll just +2 and pull then
[13:53:30] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "“deploying” (I’ll just pull it, don’t think it needs to be synced)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939694 (https://phabricator.wikimedia.org/T342249) (owner: 10Lucas Werkmeister (WMDE))
[13:54:01] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "The transfer has now finished, as per https://phabricator.wikimedia.org/T334055#9028088" [puppet] - 10https://gerrit.wikimedia.org/r/939654 (https://phabricator.wikimedia.org/T334055) (owner: 10Btullis)
[13:54:11] <wikibugs>	 (03Merged) 10jenkins-bot: tests: Test setting names [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939694 (https://phabricator.wikimedia.org/T342249) (owner: 10Lucas Werkmeister (WMDE))
[13:54:37] <Lucas_WMDE>	 !log pulled [[gerrit:939694|tests: Test setting names (T342249)]] to deploy1002 (no scap sync needed, tests-only change)
[13:54:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:54:41] <stashbot>	 T342249: Prevent incorrect variable name prefix in InitialiseSettings.php - https://phabricator.wikimedia.org/T342249
[13:56:06] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppetdb-api-next: add new discovery record for testing puppetdb-api [dns] - 10https://gerrit.wikimedia.org/r/939678 (https://phabricator.wikimedia.org/T342214) (owner: 10Jbond)
[13:56:09] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: ores-extension: enable lw on eswikiquotes and eswikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939697 (https://phabricator.wikimedia.org/T342115)
[13:56:30] <wikibugs>	 (03CR) 10Btullis: [V: 03+1 C: 03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42573/console" [puppet] - 10https://gerrit.wikimedia.org/r/939654 (https://phabricator.wikimedia.org/T334055) (owner: 10Btullis)
[13:57:04] <wikibugs>	 (03CR) 10Btullis: [V: 03+1 C: 03+2] Switch references from db1108 to db1208 [puppet] - 10https://gerrit.wikimedia.org/r/939654 (https://phabricator.wikimedia.org/T334055) (owner: 10Btullis)
[13:57:18] <wikibugs>	 10SRE, 10Data Engineering and Event Platform Team, 10MW-on-K8s, 10serviceops: Migrate rdf-streaming-updater to connect to mw-on-k8s - https://phabricator.wikimedia.org/T342252 (10dcausse) >>! In T342252#9028035, @Joe wrote: > @dcausse not sure if you're the right person to ask, if not apologies; but I want...
[13:59:02] <wikibugs>	 10SRE, 10Data Engineering and Event Platform Team, 10MW-on-K8s, 10serviceops: Migrate rdf-streaming-updater to connect to mw-on-k8s - https://phabricator.wikimedia.org/T342252 (10Joe) >>! In T342252#9028119, @dcausse wrote: >>>! In T342252#9028035, @Joe wrote: >> @dcausse not sure if you're the right perso...
[14:00:04] <jouncebot>	 Deploy window Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230719T1400)
[14:00:56] <wikibugs>	 (03PS1) 10Jbond: puppetdb::microservice: add -next domain [puppet] - 10https://gerrit.wikimedia.org/r/939698 (https://phabricator.wikimedia.org/T342214)
[14:01:54] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Infrastructure, and 3 others: update systems to use new puppetdb instance - https://phabricator.wikimedia.org/T342214 (10jbond)
[14:02:31] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2 NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42574/console" [puppet] - 10https://gerrit.wikimedia.org/r/939698 (https://phabricator.wikimedia.org/T342214) (owner: 10Jbond)
[14:03:50] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] puppetdb::microservice: add -next domain [puppet] - 10https://gerrit.wikimedia.org/r/939698 (https://phabricator.wikimedia.org/T342214) (owner: 10Jbond)
[14:05:18] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: services_proxy: add mw-api-int-async-ro [puppet] - 10https://gerrit.wikimedia.org/r/939700 (https://phabricator.wikimedia.org/T342252)
[14:06:10] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] noc: add script to dump etcd db config (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938644 (https://phabricator.wikimedia.org/T341859) (owner: 10Giuseppe Lavagetto)
[14:07:32] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:09:54] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: mw-api-int: bump replicas to 8 [deployment-charts] - 10https://gerrit.wikimedia.org/r/939701 (https://phabricator.wikimedia.org/T342252)
[14:09:56] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: rdf-streaming-updater: move to mw-api-int, use readonly endpoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/939702 (https://phabricator.wikimedia.org/T342252)
[14:10:55] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] rdf-streaming-updater: move to mw-api-int, use readonly endpoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/939702 (https://phabricator.wikimedia.org/T342252) (owner: 10Giuseppe Lavagetto)
[14:11:30] <wikibugs>	 (03PS1) 10Ssingh: sre.dns: add a new cookbook for durum reboot/service restarts [cookbooks] - 10https://gerrit.wikimedia.org/r/939704
[14:13:59] <wikibugs>	 10SRE, 10observability: Consider making a variant of the fatalmonitor CLI tool that ignores appserver timeouts - https://phabricator.wikimedia.org/T213777 (10lmata) Fatalmonitor no longer actively supported: https://wikitech.wikimedia.org/wiki/Wikimedia_binaries#fatalmonitor Untagging observability, please re-...
[14:14:15] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.ganeti.makevm for new host flink-zk1003.eqiad.wmnet
[14:14:15] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host analytics1075.eqiad.wmnet with OS bullseye
[14:14:16] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.dns.netbox
[14:14:21] <wikibugs>	 10SRE: Consider making a variant of the fatalmonitor CLI tool that ignores appserver timeouts - https://phabricator.wikimedia.org/T213777 (10colewhite)
[14:14:36] <wikibugs>	 10SRE, 10Data-Engineering, 10Data-Platform-SRE, 10SRE Observability: Grant IdempotentWrite Kafka Cluster ACL to User:ANONYMOUS in all Kafka clusters - https://phabricator.wikimedia.org/T334733 (10lmata)
[14:15:12] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] "Nice work and catch!" [cookbooks] - 10https://gerrit.wikimedia.org/r/939381 (https://phabricator.wikimedia.org/T342182) (owner: 10BCornwall)
[14:16:06] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[14:16:08] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] Remove custom Puppet disable on WDNS reboot (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/939381 (https://phabricator.wikimedia.org/T342182) (owner: 10BCornwall)
[14:16:10] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host flink-zk1003.eqiad.wmnet
[14:17:32] <jinxer-wm>	 (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:19:16] <wikibugs>	 (03CR) 10ArielGlenn: [C: 04-2] "This change is incomplete; don't review or merge yet, thanks" [puppet] - 10https://gerrit.wikimedia.org/r/939674 (https://phabricator.wikimedia.org/T325232) (owner: 10ArielGlenn)
[14:19:49] <logmsgbot>	 !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host analytics1073.eqiad.wmnet with OS bullseye
[14:20:21] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host analytics1073.eqiad.wmnet with OS bullseye
[14:21:03] <wikibugs>	 (03PS1) 10Jbond: netbox: make the puppetdb microservic domain configurable [puppet] - 10https://gerrit.wikimedia.org/r/939706 (https://phabricator.wikimedia.org/T342214)
[14:21:10] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hadoop.roll-restart-masters restart masters for Hadoop test cluster: Restart of jvm daemons.
[14:21:36] <wikibugs>	 (03PS1) 10Fabfur: haproxy: disable keepalive on port 80 for cp5024 [puppet] - 10https://gerrit.wikimedia.org/r/939707 (https://phabricator.wikimedia.org/T342211)
[14:21:56] <wikibugs>	 (03PS1) 10Ssingh: dns5003: temporarily remove from authdns_servers for restart [puppet] - 10https://gerrit.wikimedia.org/r/939708
[14:22:10] <wikibugs>	 10SRE, 10Data-Engineering, 10Data-Platform-SRE: Grant IdempotentWrite Kafka Cluster ACL to User:ANONYMOUS in all Kafka clusters - https://phabricator.wikimedia.org/T334733 (10herron) Untagging observability to table this wrt the kafka-logging cluster for the time being.  Will need to revisit the kafka-loggin...
[14:22:14] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42575/console" [puppet] - 10https://gerrit.wikimedia.org/r/939706 (https://phabricator.wikimedia.org/T342214) (owner: 10Jbond)
[14:24:09] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] dns5003: temporarily remove from authdns_servers for restart [puppet] - 10https://gerrit.wikimedia.org/r/939708 (owner: 10Ssingh)
[14:25:44] <wikibugs>	 (03PS1) 10Jbond: netbox: drop pupetdb_host as its not used [puppet] - 10https://gerrit.wikimedia.org/r/939709 (https://phabricator.wikimedia.org/T342214)
[14:26:27] <wikibugs>	 (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (NOOP 3 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42576/console" [puppet] - 10https://gerrit.wikimedia.org/r/939707 (https://phabricator.wikimedia.org/T342211) (owner: 10Fabfur)
[14:26:32] <wikibugs>	 (03CR) 10Ayounsi: "I can't review the syntax itself but the logic lgtm with 1 comment." [puppet] - 10https://gerrit.wikimedia.org/r/939695 (owner: 10Filippo Giunchedi)
[14:26:56] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42577/console" [puppet] - 10https://gerrit.wikimedia.org/r/939709 (https://phabricator.wikimedia.org/T342214) (owner: 10Jbond)
[14:27:58] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] netbox: make the puppetdb microservic domain configurable [puppet] - 10https://gerrit.wikimedia.org/r/939706 (https://phabricator.wikimedia.org/T342214) (owner: 10Jbond)
[14:28:35] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.ganeti.makevm for new host flink-zk1003.eqiad.wmnet
[14:28:36] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.dns.netbox
[14:29:30] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] netbox: drop pupetdb_host as its not used [puppet] - 10https://gerrit.wikimedia.org/r/939709 (https://phabricator.wikimedia.org/T342214) (owner: 10Jbond)
[14:29:43] <sukhe>	 DNS/BGP alerts in eqsin expected, restarts of DNS hosts
[14:29:51] <sukhe>	 will be keeping an eye out here but no cause for alarm
[14:30:01] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[14:30:05] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host flink-zk1003.eqiad.wmnet
[14:32:28] <wikibugs>	 (03PS1) 10Jbond: netbox::standalone: switch to using new puppetdb api [puppet] - 10https://gerrit.wikimedia.org/r/939710 (https://phabricator.wikimedia.org/T342214)
[14:32:51] <icinga-wm>	 PROBLEM - Bird Internet Routing Daemon on dns5003 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[14:32:58] <sukhe>	 ^ expected
[14:33:15] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:33:24] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host dns5003.wikimedia.org
[14:33:33] <icinga-wm>	 PROBLEM - BFD status on cr3-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:33:49] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.dns.netbox
[14:33:54] <wikibugs>	 (03PS2) 10Jbond: netbox::standalone: switch to using new puppetdb api [puppet] - 10https://gerrit.wikimedia.org/r/939710 (https://phabricator.wikimedia.org/T342214)
[14:35:00] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42579/console" [puppet] - 10https://gerrit.wikimedia.org/r/939710 (https://phabricator.wikimedia.org/T342214) (owner: 10Jbond)
[14:35:05] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:35:41] <wikibugs>	 (03PS1) 10ArielGlenn: swap in dumpsdata1007 as the new fallback xml dumps nfs share [puppet] - 10https://gerrit.wikimedia.org/r/939711 (https://phabricator.wikimedia.org/T325232)
[14:36:52] <wikibugs>	 (03PS3) 10Jbond: netbox::standalone: switch to using new puppetdb api [puppet] - 10https://gerrit.wikimedia.org/r/939710 (https://phabricator.wikimedia.org/T342214)
[14:36:54] <wikibugs>	 (03PS1) 10Jbond: netbox: actully use puppetdb_microservice_fqdn [puppet] - 10https://gerrit.wikimedia.org/r/939712 (https://phabricator.wikimedia.org/T342214)
[14:37:06] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dns5003.wikimedia.org
[14:37:25] <icinga-wm>	 PROBLEM - Bird Internet Routing Daemon on dns5003 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[14:38:01] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42580/console" [puppet] - 10https://gerrit.wikimedia.org/r/939710 (https://phabricator.wikimedia.org/T342214) (owner: 10Jbond)
[14:38:06] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42581/console" [puppet] - 10https://gerrit.wikimedia.org/r/939712 (https://phabricator.wikimedia.org/T342214) (owner: 10Jbond)
[14:38:09] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host mw1413
[14:38:27] <wikibugs>	 (03CR) 10Alexandros Kosiaris: modules: Add a new networkpolicy for base modules (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/935746 (https://phabricator.wikimedia.org/T340843) (owner: 10Alexandros Kosiaris)
[14:38:33] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] netbox: actully use puppetdb_microservice_fqdn [puppet] - 10https://gerrit.wikimedia.org/r/939712 (https://phabricator.wikimedia.org/T342214) (owner: 10Jbond)
[14:38:38] <wikibugs>	 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2022/2023-Q4): eqiad1: cloudlb: reimage cloudcontrol1005 into new network setup - https://phabricator.wikimedia.org/T341495 (10Jclark-ctr)
[14:39:15] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host mw1413
[14:39:27] <wikibugs>	 (03PS1) 10Ssingh: Revert "dns5003: temporarily remove from authdns_servers for restart" [puppet] - 10https://gerrit.wikimedia.org/r/939342
[14:39:42] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host mw1412
[14:40:16] <wikibugs>	 (03PS2) 10Fabfur: haproxy: disable keepalive on port 80 for cp5024 [puppet] - 10https://gerrit.wikimedia.org/r/939707 (https://phabricator.wikimedia.org/T342211)
[14:40:25] <icinga-wm>	 RECOVERY - Bird Internet Routing Daemon on dns5003 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[14:40:49] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: UP: 14 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:40:49] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host mw1412
[14:41:07] <icinga-wm>	 RECOVERY - BFD status on cr3-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:41:58] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] Revert "dns5003: temporarily remove from authdns_servers for restart" [puppet] - 10https://gerrit.wikimedia.org/r/939342 (owner: 10Ssingh)
[14:42:37] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: modules: Add a new networkpolicy for base modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/935746 (https://phabricator.wikimedia.org/T340843)
[14:42:39] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: cxserver: Bump to networkpolicy_1.1.0.tpl [deployment-charts] - 10https://gerrit.wikimedia.org/r/935748 (https://phabricator.wikimedia.org/T341117)
[14:42:41] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: cxserver: Migrate to the new MariaDB egress functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/935749 (https://phabricator.wikimedia.org/T341117)
[14:43:19] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] modules: Add a new networkpolicy for base modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/935746 (https://phabricator.wikimedia.org/T340843) (owner: 10Alexandros Kosiaris)
[14:43:23] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] cxserver: Bump to networkpolicy_1.1.0.tpl [deployment-charts] - 10https://gerrit.wikimedia.org/r/935748 (https://phabricator.wikimedia.org/T341117) (owner: 10Alexandros Kosiaris)
[14:43:25] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] cxserver: Migrate to the new MariaDB egress functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/935749 (https://phabricator.wikimedia.org/T341117) (owner: 10Alexandros Kosiaris)
[14:43:27] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] netbox::standalone: switch to using new puppetdb api [puppet] - 10https://gerrit.wikimedia.org/r/939710 (https://phabricator.wikimedia.org/T342214) (owner: 10Jbond)
[14:43:29] <wikibugs>	 (03PS3) 10Alexandros Kosiaris: modules: Add a new networkpolicy for base modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/935746 (https://phabricator.wikimedia.org/T340843)
[14:43:31] <wikibugs>	 (03PS3) 10Alexandros Kosiaris: cxserver: Bump to networkpolicy_1.1.0.tpl [deployment-charts] - 10https://gerrit.wikimedia.org/r/935748 (https://phabricator.wikimedia.org/T341117)
[14:43:33] <wikibugs>	 (03PS3) 10Alexandros Kosiaris: cxserver: Migrate to the new MariaDB egress functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/935749 (https://phabricator.wikimedia.org/T341117)
[14:43:49] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] netbox::standalone: switch to using new puppetdb api [puppet] - 10https://gerrit.wikimedia.org/r/939710 (https://phabricator.wikimedia.org/T342214) (owner: 10Jbond)
[14:44:09] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] cxserver: Bump to networkpolicy_1.1.0.tpl [deployment-charts] - 10https://gerrit.wikimedia.org/r/935748 (https://phabricator.wikimedia.org/T341117) (owner: 10Alexandros Kosiaris)
[14:44:11] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] modules: Add a new networkpolicy for base modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/935746 (https://phabricator.wikimedia.org/T340843) (owner: 10Alexandros Kosiaris)
[14:44:19] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] cxserver: Migrate to the new MariaDB egress functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/935749 (https://phabricator.wikimedia.org/T341117) (owner: 10Alexandros Kosiaris)
[14:44:42] <wikibugs>	 (03PS1) 10Ayounsi: Add Loopback to INTERFACES_REGEXP [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/939714
[14:45:35] <wikibugs>	 (03CR) 10Alexandros Kosiaris: cxserver: Bump to networkpolicy_1.1.0.tpl (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/935748 (https://phabricator.wikimedia.org/T341117) (owner: 10Alexandros Kosiaris)
[14:45:42] <wikibugs>	 (03CR) 10Alexandros Kosiaris: cxserver: Bump to networkpolicy_1.1.0.tpl (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/935748 (https://phabricator.wikimedia.org/T341117) (owner: 10Alexandros Kosiaris)
[14:47:15] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/939714 (owner: 10Ayounsi)
[14:48:35] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack C8 - https://phabricator.wikimedia.org/T308339 (10RobH)
[14:48:37] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: mw-api-int: bump replicas to 8 [deployment-charts] - 10https://gerrit.wikimedia.org/r/939701 (https://phabricator.wikimedia.org/T342252)
[14:48:39] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: rdf-streaming-updater: move to mw-api-int, use readonly endpoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/939702 (https://phabricator.wikimedia.org/T342252)
[14:48:43] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: mw-api-int: increase namespace limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/939716 (https://phabricator.wikimedia.org/T342252)
[14:49:29] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] rdf-streaming-updater: move to mw-api-int, use readonly endpoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/939702 (https://phabricator.wikimedia.org/T342252) (owner: 10Giuseppe Lavagetto)
[14:49:35] <wikibugs>	 (03CR) 10Jbond: "lgtm but see comment re examples" [cookbooks] - 10https://gerrit.wikimedia.org/r/939704 (owner: 10Ssingh)
[14:51:03] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.dns.netbox
[14:51:05] <wikibugs>	 (03PS2) 10Ayounsi: Add Loopback to INTERFACES_REGEXP [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/939714
[14:51:44] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Add Loopback to INTERFACES_REGEXP [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/939714 (owner: 10Ayounsi)
[14:52:15] <wikibugs>	 (03Merged) 10jenkins-bot: Add Loopback to INTERFACES_REGEXP [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/939714 (owner: 10Ayounsi)
[14:53:14] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt cloudcontrol1005 - jclark@cumin1001"
[14:53:30] <wikibugs>	 (03PS2) 10JMeybohm: wikifunctions: Update orchestrator and evaluator [deployment-charts] - 10https://gerrit.wikimedia.org/r/939686 (https://phabricator.wikimedia.org/T297314)
[14:53:32] <wikibugs>	 (03PS2) 10JMeybohm: wikifunctions: Enable mesh and ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/939687 (https://phabricator.wikimedia.org/T297314)
[14:53:34] <wikibugs>	 (03PS1) 10JMeybohm: CI: TestOutcome for diffs requires stdout to not be empty [deployment-charts] - 10https://gerrit.wikimedia.org/r/939718 (https://phabricator.wikimedia.org/T297314)
[14:53:36] <wikibugs>	 (03PS1) 10Elukey: knative-serving: add options to tune every config-map [deployment-charts] - 10https://gerrit.wikimedia.org/r/939719
[14:53:38] <wikibugs>	 (03PS1) 10Elukey: admin_ng: set scale-down value for knative-serving [deployment-charts] - 10https://gerrit.wikimedia.org/r/939720
[14:54:09] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt cloudcontrol1005 - jclark@cumin1001"
[14:54:09] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[14:54:17] <wikibugs>	 (03PS2) 10Filippo Giunchedi: icinga_exporter: team-tag netops icinga alerts [puppet] - 10https://gerrit.wikimedia.org/r/939695
[14:54:28] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary
[14:54:29] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wikifunctions: Update orchestrator and evaluator [deployment-charts] - 10https://gerrit.wikimedia.org/r/939686 (https://phabricator.wikimedia.org/T297314) (owner: 10JMeybohm)
[14:54:41] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wikifunctions: Enable mesh and ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/939687 (https://phabricator.wikimedia.org/T297314) (owner: 10JMeybohm)
[14:55:00] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary
[14:55:00] <wikibugs>	 (03CR) 10Filippo Giunchedi: icinga_exporter: team-tag netops icinga alerts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/939695 (owner: 10Filippo Giunchedi)
[14:55:16] <wikibugs>	 (03PS1) 10Btullis: Fail back hive services to the primary server [dns] - 10https://gerrit.wikimedia.org/r/939721 (https://phabricator.wikimedia.org/T329716)
[14:55:33] <wikibugs>	 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2022/2023-Q4): eqiad1: cloudlb: reimage cloudcontrol1005 into new network setup - https://phabricator.wikimedia.org/T341495 (10Jclark-ctr)
[14:56:19] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Fail back hive services to the primary server [dns] - 10https://gerrit.wikimedia.org/r/939721 (https://phabricator.wikimedia.org/T329716) (owner: 10Btullis)
[14:56:42] <wikibugs>	 (03PS1) 10Ayounsi: Also add SONiC vlan naming to INTERFACES_REGEXP [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/939722
[14:56:58] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] ssh: switch to using the same file we use in production [puppet] - 10https://gerrit.wikimedia.org/r/936692 (https://phabricator.wikimedia.org/T340947) (owner: 10Jbond)
[14:57:24] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Also add SONiC vlan naming to INTERFACES_REGEXP [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/939722 (owner: 10Ayounsi)
[14:57:56] <wikibugs>	 (03Merged) 10jenkins-bot: Also add SONiC vlan naming to INTERFACES_REGEXP [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/939722 (owner: 10Ayounsi)
[14:58:15] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary
[14:58:35] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary
[14:59:42] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox
[15:01:29] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/939707 (https://phabricator.wikimedia.org/T342211) (owner: 10Fabfur)
[15:01:46] <wikibugs>	 (03CR) 10Fabfur: [C: 03+2] haproxy: disable keepalive on port 80 for cp5024 [puppet] - 10https://gerrit.wikimedia.org/r/939707 (https://phabricator.wikimedia.org/T342211) (owner: 10Fabfur)
[15:03:00] <fabfur>	 !log disabling keepalive on port 80 for cp5024  https://gerrit.wikimedia.org/r/939707 (T342211) 
[15:03:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:03:05] <stashbot>	 T342211: Disable keep-alive on HAProxy port 80 - https://phabricator.wikimedia.org/T342211
[15:04:24] <wikibugs>	 10ops-eqiad, 10DBA, 10decommission-hardware: decommission dbproxy1016.eqiad.wmnet - https://phabricator.wikimedia.org/T342224 (10wiki_willy) a:05wiki_willy→03Jclark-ctr
[15:04:47] <wikibugs>	 (03PS4) 10Alexandros Kosiaris: modules: Add a new networkpolicy for base modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/935746 (https://phabricator.wikimedia.org/T340843)
[15:04:49] <wikibugs>	 (03PS4) 10Alexandros Kosiaris: cxserver: Bump to networkpolicy_1.1.0.tpl [deployment-charts] - 10https://gerrit.wikimedia.org/r/935748 (https://phabricator.wikimedia.org/T341117)
[15:04:51] <wikibugs>	 (03PS4) 10Alexandros Kosiaris: cxserver: Migrate to the new MariaDB egress functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/935749 (https://phabricator.wikimedia.org/T341117)
[15:05:32] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] cxserver: Bump to networkpolicy_1.1.0.tpl [deployment-charts] - 10https://gerrit.wikimedia.org/r/935748 (https://phabricator.wikimedia.org/T341117) (owner: 10Alexandros Kosiaris)
[15:05:37] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] cxserver: Migrate to the new MariaDB egress functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/935749 (https://phabricator.wikimedia.org/T341117) (owner: 10Alexandros Kosiaris)
[15:06:22] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Add the utils directory; tool to generate reports about deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/939723
[15:06:49] <wikibugs>	 (03CR) 10ArielGlenn: [C: 03+2] swap in dumpsdata1007 as the new fallback xml dumps nfs share [puppet] - 10https://gerrit.wikimedia.org/r/939711 (https://phabricator.wikimedia.org/T325232) (owner: 10ArielGlenn)
[15:07:18] <wikibugs>	 10SRE, 10Traffic: increased 5xx rate for esams frontend traffic - https://phabricator.wikimedia.org/T342121 (10TheDJ) 05Resolved→03Open Hmm. actually.. Seems there is also an exceptional amount of 4xx errors ? Especially today it seems to have exploded.  https://grafana.wikimedia.org/d/000000479/cdn-fronte...
[15:07:24] <wikibugs>	 (03PS8) 10Jforrester: [DNM] Initial configuration for Wikifunctions.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/934631 (https://phabricator.wikimedia.org/T275945)
[15:07:28] <wikibugs>	 (03PS5) 10Alexandros Kosiaris: cxserver: Bump to networkpolicy_1.1.0.tpl [deployment-charts] - 10https://gerrit.wikimedia.org/r/935748 (https://phabricator.wikimedia.org/T341117)
[15:07:30] <wikibugs>	 (03PS5) 10Alexandros Kosiaris: cxserver: Migrate to the new MariaDB egress functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/935749 (https://phabricator.wikimedia.org/T341117)
[15:07:41] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] knative-serving: add options to tune every config-map [deployment-charts] - 10https://gerrit.wikimedia.org/r/939719 (owner: 10Elukey)
[15:07:49] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox
[15:08:06] <wikibugs>	 10ops-eqiad, 10DBA, 10decommission-hardware: decommission dbproxy1016.eqiad.wmnet - https://phabricator.wikimedia.org/T342224 (10Jclark-ctr)
[15:08:10] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] cxserver: Bump to networkpolicy_1.1.0.tpl [deployment-charts] - 10https://gerrit.wikimedia.org/r/935748 (https://phabricator.wikimedia.org/T341117) (owner: 10Alexandros Kosiaris)
[15:08:13] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] cxserver: Migrate to the new MariaDB egress functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/935749 (https://phabricator.wikimedia.org/T341117) (owner: 10Alexandros Kosiaris)
[15:08:16] <wikibugs>	 10ops-eqiad, 10DBA, 10decommission-hardware: decommission dbproxy1016.eqiad.wmnet - https://phabricator.wikimedia.org/T342224 (10Jclark-ctr) 05Open→03Resolved
[15:09:54] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.ganeti.makevm for new host flink-zk1003.eqiad.wmnet
[15:09:55] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.dns.netbox
[15:10:14] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Infrastructure, and 3 others: update systems to use new puppetdb instance - https://phabricator.wikimedia.org/T342214 (10jbond) for puppetdb-api.  i have updated netbox-next and tested the following:  === Reports * [[  https://netbox-next.wikimedia...
[15:11:01] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host mw1411
[15:11:02] <wikibugs>	 (03PS6) 10Alexandros Kosiaris: cxserver: Bump to networkpolicy_1.1.0.tpl [deployment-charts] - 10https://gerrit.wikimedia.org/r/935748 (https://phabricator.wikimedia.org/T341117)
[15:11:04] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host mw1411
[15:11:04] <wikibugs>	 (03PS6) 10Alexandros Kosiaris: cxserver: Migrate to the new MariaDB egress functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/935749 (https://phabricator.wikimedia.org/T341117)
[15:11:25] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[15:11:28] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host flink-zk1003.eqiad.wmnet
[15:12:49] <sukhe>	 authdns-update is failing
[15:13:07] <sukhe>	 seems to be some incorrect netbox changes
[15:13:11] <wikibugs>	 (03PS3) 10JMeybohm: wikifunctions: Update orchestrator and evaluator [deployment-charts] - 10https://gerrit.wikimedia.org/r/939686 (https://phabricator.wikimedia.org/T297314)
[15:13:14] <wikibugs>	 (03PS3) 10JMeybohm: wikifunctions: Enable mesh and ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/939687 (https://phabricator.wikimedia.org/T297314)
[15:13:16] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C: 03+1] admin_ng: set scale-down value for knative-serving [deployment-charts] - 10https://gerrit.wikimedia.org/r/939720 (owner: 10Elukey)
[15:13:21] <sukhe>	 E103|TOO_MANY_NAMES: Found 2 name(s) for IP '10.65.4.188', expected 1: netbox/eqiad.wmnet:491 cloudcontrol1005.eqiad.wmnet. A 10.65.4.188 netbox/eqiad.wmnet:2130 wmf5349.eqiad.wmnet. A 10.65.4.188 
[15:13:52] <sukhe>	 any idea who is working on this?
[15:13:56] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1046.eqiad.wmnet
[15:13:59] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C: 03+1] knative-serving: add options to tune every config-map [deployment-charts] - 10https://gerrit.wikimedia.org/r/939719 (owner: 10Elukey)
[15:15:16] <wikibugs>	 (03PS1) 10Jbond: sre.discovery.datacenter: exclude puppetdb-api-next [cookbooks] - 10https://gerrit.wikimedia.org/r/939725 (https://phabricator.wikimedia.org/T342214)
[15:18:01] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.dns.netbox
[15:18:28] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox
[15:18:35] <sukhe>	 XioNoX: ^
[15:18:44] <sukhe>	 I am running to clear up broken authdns-update
[15:18:47] <sukhe>	 see above
[15:18:49] <sukhe>	 11:13:21 < sukhe> E103|TOO_MANY_NAMES: Found 2 name(s) for IP '10.65.4.188', expected 1: netbox/eqiad.wmnet:491 cloudcontrol1005.eqiad.wmnet. A 10.65.4.188 netbox/eqiad.wmnet:2130 wmf5349.eqiad.wmnet. A 10.65.4.188 
[15:18:58] <XioNoX>	 eh
[15:19:00] <logmsgbot>	 !log ayounsi@cumin1001 END (ERROR) - Cookbook sre.dns.netbox (exit_code=97)
[15:19:05] <XioNoX>	 cancelled mine
[15:19:06] <wikibugs>	 (03PS1) 10Jbond: DO NOT MERGE: Change to test new puppetdb-api-next [cookbooks] - 10https://gerrit.wikimedia.org/r/939726 (https://phabricator.wikimedia.org/T342214)
[15:19:08] <sukhe>	 thanks
[15:19:16] <sukhe>	 not sure what happened here but trying to run and see
[15:19:17] <XioNoX>	 +1 to merge my lsw1-e8 related changes
[15:19:22] <sukhe>	 ok thanks
[15:19:33] <sukhe>	 hopefully we are in a position to merge :P 
[15:19:34] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] "I dislike that I needed to add the mappings to the chart. I need to revisit this a bit." [deployment-charts] - 10https://gerrit.wikimedia.org/r/935748 (https://phabricator.wikimedia.org/T341117) (owner: 10Alexandros Kosiaris)
[15:19:54] <sukhe>	 cool, that worked 
[15:20:07] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] knative-serving: add options to tune every config-map [deployment-charts] - 10https://gerrit.wikimedia.org/r/939719 (owner: 10Elukey)
[15:20:08] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1046.eqiad.wmnet
[15:20:24] <sukhe>	 spoke too soon
[15:20:27] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] "In the followup patch I needed to add the mappings to the values of the chart. This is clearly not sustainable long-term, but I think I ca" [deployment-charts] - 10https://gerrit.wikimedia.org/r/935746 (https://phabricator.wikimedia.org/T340843) (owner: 10Alexandros Kosiaris)
[15:20:31] <XioNoX>	 sukhe: not sure what happened but https://netbox.wikimedia.org/ipam/ip-addresses/2150/changelog/
[15:20:42] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] admin_ng: set scale-down value for knative-serving [deployment-charts] - 10https://gerrit.wikimedia.org/r/939720 (owner: 10Elukey)
[15:21:05] <sukhe>	 yeah, this is broken:
[15:21:05] <sukhe>	 E103|TOO_MANY_NAMES: Found 2 name(s) for IP '10.65.4.188', expected 1: netbox/eqiad.wmnet:491 cloudcontrol1005.eqiad.wmnet. A 10.65.4.188 netbox/eqiad.wmnet:2136 wmf5349.eqiad.wmnet. A 10.65.4.188 
[15:21:19] <sukhe>	 going to ping jclark
[15:21:44] <XioNoX>	 ohhh
[15:21:46] <XioNoX>	 sukhe: I know
[15:21:50] <XioNoX>	 DNS Name 	cloudcontrol1005.eqiad.wmnet
[15:21:54] <XioNoX>	 it should be .mgmt.
[15:22:20] <sukhe>	 aaah right indeed
[15:22:22] <sukhe>	 Asset Tag WMF5349
[15:22:42] <XioNoX>	 sukhe: fixed in netbox
[15:23:07] <sukhe>	 <3
[15:23:08] <sukhe>	 trying
[15:23:20] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bullseye
[15:23:29] <logmsgbot>	 !log sukhe@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[15:23:31] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.dns.netbox
[15:23:31] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Infrastructure, and 3 others: update systems to use new puppetdb instance - https://phabricator.wikimedia.org/T342214 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jbond@cumin1001 for host sretest1002.eqiad.wmnet with O...
[15:23:55] <wikibugs>	 (03PS1) 10Ahmon Dancy: Fix buildkitd.toml.erb [puppet] - 10https://gerrit.wikimedia.org/r/939730 (https://phabricator.wikimedia.org/T329220)
[15:24:20] <sukhe>	 please hold off on running authdns-update
[15:25:23] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: trying to resolve netbox issues - sukhe@cumin2002"
[15:26:32] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: trying to resolve netbox issues - sukhe@cumin2002"
[15:26:32] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:26:35] <sukhe>	 ok, resolved. thanks to XioNoX for fixing the broken record!
[15:27:04] <jbond>	 oh thats good i was worried the reimage i just started was going to run the dns cookbook before it was fixed :)
[15:27:33] <wikibugs>	 (03PS5) 10Alexandros Kosiaris: modules: Add a new networkpolicy for base modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/935746 (https://phabricator.wikimedia.org/T340843)
[15:27:35] <wikibugs>	 (03PS7) 10Alexandros Kosiaris: cxserver: Bump to networkpolicy_1.1.0.tpl [deployment-charts] - 10https://gerrit.wikimedia.org/r/935748 (https://phabricator.wikimedia.org/T341117)
[15:27:37] <wikibugs>	 (03PS7) 10Alexandros Kosiaris: cxserver: Migrate to the new MariaDB egress functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/935749 (https://phabricator.wikimedia.org/T341117)
[15:28:39] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[15:29:57] <wikibugs>	 (03PS2) 10Ssingh: sre.dns: add a new cookbook for durum reboot/service restarts [cookbooks] - 10https://gerrit.wikimedia.org/r/939704
[15:29:57] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[15:30:27] <wikibugs>	 (03CR) 10Alexandros Kosiaris: modules: Add a new networkpolicy for base modules (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/935746 (https://phabricator.wikimedia.org/T340843) (owner: 10Alexandros Kosiaris)
[15:30:32] <wikibugs>	 (03CR) 10Ssingh: sre.dns: add a new cookbook for durum reboot/service restarts (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/939704 (owner: 10Ssingh)
[15:30:59] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[15:32:01] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[15:34:20] <logmsgbot>	 !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host analytics1075.eqiad.wmnet with OS bullseye
[15:34:44] <wikibugs>	 (03PS4) 10JMeybohm: wikifunctions: Enable mesh and ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/939687 (https://phabricator.wikimedia.org/T297314)
[15:35:03] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:35:33] <apergos>	 !log dumpsdata1007 is now the fallback host for sql/xml dumps and for misc dumps. dumpsdata1004, the former fallback host, is now a spare.
[15:35:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:37:23] <wikibugs>	 (03PS1) 10Ayounsi: IP validator, make sure mgmt IPs have mgmt in their DNS name [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/939732
[15:37:31] <sukhe>	 :D
[15:37:39] <XioNoX>	 sukhe, jbond: https://gerrit.wikimedia.org/r/c/operations/software/netbox-extras/+/939732 
[15:37:53] <sukhe>	 ty! looking
[15:38:12] <wikibugs>	 10ops-eqiad, 10DBA, 10decommission-hardware: decommission dbproxy1015.eqiad.wmnet - https://phabricator.wikimedia.org/T342103 (10Jclark-ctr)
[15:38:16] <XioNoX>	 I'll test it on netbox-next before rolling to prod, but I need to merge it first
[15:38:19] <wikibugs>	 10ops-eqiad, 10DBA, 10decommission-hardware: decommission dbproxy1015.eqiad.wmnet - https://phabricator.wikimedia.org/T342103 (10Jclark-ctr) 05Open→03Resolved
[15:38:33] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] "Not familiar with this repo but in theory and the idea looks good." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/939732 (owner: 10Ayounsi)
[15:38:37] <wikibugs>	 (03PS1) 10Elukey: knative-serving: removing default logging config [deployment-charts] - 10https://gerrit.wikimedia.org/r/939733
[15:39:18] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm optimisation inline" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/939732 (owner: 10Ayounsi)
[15:39:22] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] knative-serving: removing default logging config [deployment-charts] - 10https://gerrit.wikimedia.org/r/939733 (owner: 10Elukey)
[15:40:03] <jinxer-wm>	 (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (PATCH configmaps) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:40:24] <logmsgbot>	 !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host analytics1073.eqiad.wmnet with OS bullseye
[15:40:37] <wikibugs>	 (03PS1) 10Ssingh: dns5004: temporarily remove from authdns_servers for restart [puppet] - 10https://gerrit.wikimedia.org/r/939734
[15:42:05] <wikibugs>	 (03PS2) 10Ayounsi: IP validator, make sure mgmt IPs have mgmt in their DNS name [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/939732
[15:42:25] <XioNoX>	 sukhe, jbond, thx, updated with your optimisation
[15:42:32] <wikibugs>	 (03CR) 10Ayounsi: IP validator, make sure mgmt IPs have mgmt in their DNS name (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/939732 (owner: 10Ayounsi)
[15:42:44] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/939732 (owner: 10Ayounsi)
[15:42:47] <jbond>	 XioNoX: +1
[15:42:53] <wikibugs>	 (03CR) 10Jforrester: "This is a lot simpler, thank you!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/939686 (https://phabricator.wikimedia.org/T297314) (owner: 10JMeybohm)
[15:42:56] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] IP validator, make sure mgmt IPs have mgmt in their DNS name [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/939732 (owner: 10Ayounsi)
[15:43:25] <wikibugs>	 (03Merged) 10jenkins-bot: IP validator, make sure mgmt IPs have mgmt in their DNS name [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/939732 (owner: 10Ayounsi)
[15:43:36] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary
[15:43:48] <sukhe>	 XioNoX: thanks for the patch <3 
[15:44:08] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary
[15:44:22] <icinga-wm>	 PROBLEM - Host parse1002 is DOWN: PING CRITICAL - Packet loss = 100%
[15:44:46] <logmsgbot>	 !log jbond@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest1002.eqiad.wmnet with OS bullseye
[15:44:57] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Infrastructure, and 3 others: update systems to use new puppetdb instance - https://phabricator.wikimedia.org/T342214 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jbond@cumin1001 for host sretest1002.eqiad.wmnet with OS bu...
[15:45:34] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] knative-serving: removing default logging config [deployment-charts] - 10https://gerrit.wikimedia.org/r/939733 (owner: 10Elukey)
[15:45:49] <XioNoX>	 sukhe, jbond https://usercontent.irccloud-cdn.com/file/M2RlEm42/Screenshot%202023-07-19%20at%2017-45-23%20Editing%20IP%20address%2010.65.4.188_16%20NetBox.png
[15:46:05] <XioNoX>	 you can try to edit https://netbox-next.wikimedia.org/ipam/ip-addresses/2150/edit/ for example (that's netbox-next)
[15:46:13] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox
[15:46:16] <XioNoX>	 rolling to prod
[15:46:34] <sukhe>	 nice
[15:48:53] <wikibugs>	 (03CR) 10BCornwall: [V: 03+1 C: 03+2] Remove custom Puppet disable on WDNS reboot (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/939381 (https://phabricator.wikimedia.org/T342182) (owner: 10BCornwall)
[15:49:00] <wikibugs>	 (03CR) 10BCornwall: [V: 03+1 C: 03+2] Allow disabling puppet on reboot [cookbooks] - 10https://gerrit.wikimedia.org/r/939377 (https://phabricator.wikimedia.org/T342182) (owner: 10BCornwall)
[15:50:41] <wikibugs>	 (03PS1) 10BCornwall: dns: Update the examples docstring to updated name [cookbooks] - 10https://gerrit.wikimedia.org/r/939736
[15:51:06] <wikibugs>	 10SRE, 10serviceops, 10Abstract Wikipedia team (Phase λ – Launch), 10Patch-For-Review, 10Service-deployment-requests: New Service Request: function-orchestrator and function-evaluator (for Wikifunctions launch) - https://phabricator.wikimedia.org/T297314 (10JMeybohm) https://gerrit.wikimedia.org/r/c/oper...
[15:52:44] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] dns: Update the examples docstring to updated name [cookbooks] - 10https://gerrit.wikimedia.org/r/939736 (owner: 10BCornwall)
[15:53:48] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[15:54:28] <wikibugs>	 (03PS1) 10Jbond: sre.hosts.reimage: connect to the micro service port [cookbooks] - 10https://gerrit.wikimedia.org/r/939738
[15:55:34] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[15:56:49] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.hosts.reimage: connect to the micro service port [cookbooks] - 10https://gerrit.wikimedia.org/r/939738 (owner: 10Jbond)
[15:57:51] <wikibugs>	 (03PS2) 10Ahmon Dancy: Fix buildkitd.toml.erb [puppet] - 10https://gerrit.wikimedia.org/r/939730 (https://phabricator.wikimedia.org/T329220)
[15:58:58] <icinga-wm>	 PROBLEM - Wikitech-static main page has content on cloudweb1003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - string Wikitech not found on https://wikitech-static.wikimedia.org:443/wiki/Main_Page?debug=true - 2232 bytes in 8.734 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static
[15:59:09] <wikibugs>	 (03PS3) 10Ahmon Dancy: Fix buildkitd.toml.erb [puppet] - 10https://gerrit.wikimedia.org/r/939730 (https://phabricator.wikimedia.org/T329220)
[15:59:49] <wikibugs>	 (03CR) 10Dduvall: [C: 03+1] Fix buildkitd.toml.erb [puppet] - 10https://gerrit.wikimedia.org/r/939730 (https://phabricator.wikimedia.org/T329220) (owner: 10Ahmon Dancy)
[16:00:02] <icinga-wm>	 RECOVERY - Wikitech-static main page has content on cloudweb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 26069 bytes in 1.641 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static
[16:00:03] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[16:01:06] <dancy>	 rzl: Would you be available to deploy https://gerrit.wikimedia.org/r/c/operations/puppet/+/939730 ?
[16:02:23] <dancy>	 or maybe arnoldokoth ?
[16:04:55] <wikibugs>	 (03PS1) 10Jbond: proifile::puppetdb::microservice: add allowed_roles [puppet] - 10https://gerrit.wikimedia.org/r/939741
[16:04:57] <wikibugs>	 (03PS1) 10Jbond: dns::recursor: filter out undef value [puppet] - 10https://gerrit.wikimedia.org/r/939742
[16:05:03] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[16:05:09] <wikibugs>	 (03PS2) 10Jbond: dns::recursor: filter out undef value [puppet] - 10https://gerrit.wikimedia.org/r/939742
[16:05:23] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] dns::recursor: filter out undef value [puppet] - 10https://gerrit.wikimedia.org/r/939742 (owner: 10Jbond)
[16:05:34] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] dns::recursor: filter out undef value [puppet] - 10https://gerrit.wikimedia.org/r/939742 (owner: 10Jbond)
[16:06:12] <arnoldokoth>	 dancy: Yeah, I'm available.
[16:06:19] <dancy>	 sweet!
[16:07:32] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] proifile::puppetdb::microservice: add allowed_roles [puppet] - 10https://gerrit.wikimedia.org/r/939741 (owner: 10Jbond)
[16:07:47] <dancy>	 arnoldokoth: eoghan saod hje
[16:08:03] <dancy>	 oops.. eoghan said he'd look at it too.
[16:08:34] <wikibugs>	 (03PS3) 10Jbond: dns::recursor: filter out undef value [puppet] - 10https://gerrit.wikimedia.org/r/939742
[16:09:22] <arnoldokoth>	 dancy: no problem.
[16:09:44] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42585/console" [puppet] - 10https://gerrit.wikimedia.org/r/939742 (owner: 10Jbond)
[16:11:29] <wikibugs>	 (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42586/console" [puppet] - 10https://gerrit.wikimedia.org/r/939730 (https://phabricator.wikimedia.org/T329220) (owner: 10Ahmon Dancy)
[16:14:21] <wikibugs>	 (03PS2) 10Jbond: sre.hosts.reimage: connect to the micro service port [cookbooks] - 10https://gerrit.wikimedia.org/r/939738
[16:14:23] <wikibugs>	 (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (NOOP 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42587/console" [puppet] - 10https://gerrit.wikimedia.org/r/939730 (https://phabricator.wikimedia.org/T329220) (owner: 10Ahmon Dancy)
[16:14:31] <wikibugs>	 (03CR) 10Jbond: [C: 04-1] "We first need to update the microservice" [cookbooks] - 10https://gerrit.wikimedia.org/r/939738 (owner: 10Jbond)
[16:14:54] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[16:15:27] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Request for Turnilo Access - https://phabricator.wikimedia.org/T342132 (10Mpossoupe) Tagging @BelindaMbambo as my manager for approval
[16:16:59] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS for e4 mgmt entries - cmooney@cumin1001"
[16:17:23] <wikibugs>	 (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42588/console" [puppet] - 10https://gerrit.wikimedia.org/r/939730 (https://phabricator.wikimedia.org/T329220) (owner: 10Ahmon Dancy)
[16:17:27] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.hosts.reimage: connect to the micro service port [cookbooks] - 10https://gerrit.wikimedia.org/r/939738 (owner: 10Jbond)
[16:17:29] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Request for Turnilo Access - https://phabricator.wikimedia.org/T342132 (10BelindaMbambo) Dear @andrea.denisse I am approving @Mpossoupe for the above request for tunilo , thank you
[16:18:42] <wikibugs>	 (03CR) 10Gergő Tisza: [C: 03+1] IP Masking: Enable for cswiki beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938915 (https://phabricator.wikimedia.org/T342034) (owner: 10Urbanecm)
[16:19:08] <wikibugs>	 (03CR) 10EoghanGaffney: [V: 03+1 C: 03+2] Fix buildkitd.toml.erb [puppet] - 10https://gerrit.wikimedia.org/r/939730 (https://phabricator.wikimedia.org/T329220) (owner: 10Ahmon Dancy)
[16:19:32] <wikibugs>	 (03PS4) 10Ssingh: dns::recursor: filter out undef value [puppet] - 10https://gerrit.wikimedia.org/r/939742 (owner: 10Jbond)
[16:20:22] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS for e4 mgmt entries - cmooney@cumin1001"
[16:20:22] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:20:26] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[16:20:35] <wikibugs>	 (03CR) 10Jforrester: wikifunctions: Attempt to write out our main config as JSON (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/938300 (owner: 10Jforrester)
[16:20:37] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42589/console" [puppet] - 10https://gerrit.wikimedia.org/r/939742 (owner: 10Jbond)
[16:20:45] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[16:21:37] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[16:25:39] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS for e4 mgmt entries - cmooney@cumin1001"
[16:26:24] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS for e4 mgmt entries - cmooney@cumin1001"
[16:26:24] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:27:34] <icinga-wm>	 PROBLEM - Check systemd state on db1108 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-mysqld-exporter@matomo.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:29:52] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "https://puppet-compiler.wmflabs.org/output/939742/42593/ NOOP on all DNS hosts" [puppet] - 10https://gerrit.wikimedia.org/r/939742 (owner: 10Jbond)
[16:29:56] <logmsgbot>	 !log joal@deploy1002 Started deploy [airflow-dags/analytics@4c06501]: Fix bug introduced in cassandra loading jobs
[16:30:08] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "Thanks for the patch! I am going to merge it as per your permission." [puppet] - 10https://gerrit.wikimedia.org/r/939742 (owner: 10Jbond)
[16:30:11] <logmsgbot>	 !log joal@deploy1002 Finished deploy [airflow-dags/analytics@4c06501]: Fix bug introduced in cassandra loading jobs (duration: 00m 15s)
[16:31:04] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: ml-services: inference chart change .wiki to reflect wikiID [deployment-charts] - 10https://gerrit.wikimedia.org/r/939744 (https://phabricator.wikimedia.org/T342266)
[16:31:46] <wikibugs>	 (03PS2) 10Ilias Sarantopoulos: ml-services: inference chart change .wiki to reflect wikiID [deployment-charts] - 10https://gerrit.wikimedia.org/r/939744 (https://phabricator.wikimedia.org/T342266)
[16:31:49] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] icinga_exporter: team-tag netops icinga alerts [puppet] - 10https://gerrit.wikimedia.org/r/939695 (owner: 10Filippo Giunchedi)
[16:33:01] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: ml-services: inference chart change .wiki to reflect wikiID (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/939744 (https://phabricator.wikimedia.org/T342266) (owner: 10Ilias Sarantopoulos)
[16:33:53] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1 C: 03+2] dns::recursor: filter out undef value [puppet] - 10https://gerrit.wikimedia.org/r/939742 (owner: 10Jbond)
[16:36:46] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Cyndymediawiksim - https://phabricator.wikimedia.org/T342230 (10DMburugu) I approve the request.
[16:37:58] <wikibugs>	 (03PS1) 10Elukey: knative-serving: set a more lenient readiness probe for the webhook [deployment-charts] - 10https://gerrit.wikimedia.org/r/939745
[16:39:01] <wikibugs>	 (03CR) 10Klausman: [V: 03+1 C: 03+1] knative-serving: set a more lenient readiness probe for the webhook [deployment-charts] - 10https://gerrit.wikimedia.org/r/939745 (owner: 10Elukey)
[16:40:42] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] knative-serving: set a more lenient readiness probe for the webhook [deployment-charts] - 10https://gerrit.wikimedia.org/r/939745 (owner: 10Elukey)
[16:43:21] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[16:44:06] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Cyndymediawiksim - https://phabricator.wikimedia.org/T342230 (10Aklapper) @Cyndymediawiksim I am sorry that I was unclear. "MediaWiki OAuth1 Account" on https://phabricator.wikimedia.org/settings/panel/external/ should ideally link your MediaWiki work acco...
[16:44:14] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[16:46:52] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[16:47:43] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[16:50:03] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (PATCH configmaps) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[16:53:10] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox
[16:55:03] <jinxer-wm>	 (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (PATCH configmaps) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[16:55:21] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host druid1009.eqiad.wmnet
[16:56:43] <logmsgbot>	 !log jforrester@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[16:56:46] <logmsgbot>	 !log jforrester@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[16:57:29] <wikibugs>	 (03CR) 10BCornwall: [C: 03+2] dns: Update the examples docstring to updated name [cookbooks] - 10https://gerrit.wikimedia.org/r/939736 (owner: 10BCornwall)
[16:57:49] <wikibugs>	 (03CR) 10Jforrester: [C: 03+1] wikifunctions: Update orchestrator and evaluator [deployment-charts] - 10https://gerrit.wikimedia.org/r/939686 (https://phabricator.wikimedia.org/T297314) (owner: 10JMeybohm)
[16:57:53] <wikibugs>	 (03CR) 10Jforrester: [C: 03+1] wikifunctions: Enable mesh and ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/939687 (https://phabricator.wikimedia.org/T297314) (owner: 10JMeybohm)
[16:58:24] <wikibugs>	 (03CR) 10BCornwall: [V: 03+1 C: 03+2] Remove custom Puppet disable on WDNS reboot (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/939381 (https://phabricator.wikimedia.org/T342182) (owner: 10BCornwall)
[17:00:04] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230719T1700)
[17:00:48] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host druid1009.eqiad.wmnet
[17:02:09] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox
[17:09:20] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough and A:wikidough
[17:09:42] <sukhe>	 BGP alerts expected in all sites
[17:12:07] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] dns5004: temporarily remove from authdns_servers for restart [puppet] - 10https://gerrit.wikimedia.org/r/939734 (owner: 10Ssingh)
[17:15:34] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host dns5004.wikimedia.org
[17:15:44] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[17:15:50] <icinga-wm>	 PROBLEM - BFD status on cr3-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[17:16:29] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Platform-SRE, 10Infrastructure-Foundations: analytics1073 and analytics1075 - loss of connectivity - https://phabricator.wikimedia.org/T342141 (10BTullis) We've been investigating this extensively and discussing in some depth on #wikimedia-dcops on IRC.  We've decided to go ahead...
[17:16:30] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host druid1010.eqiad.wmnet
[17:17:42] <wikibugs>	 10SRE, 10ops-eqiad, 10Traffic: Relocate lvs1013-lvs1016 to rows E & F - https://phabricator.wikimedia.org/T341992 (10ayounsi) FYI this Netbox report is alerting: https://netbox.wikimedia.org/extras/reports/results/4808787/#test_port_block_consistency ` xe-0/0/41 [eqiad] Interface type '10gbase-x-sfpp' does n...
[17:19:26] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dns5004.wikimedia.org
[17:20:16] <wikibugs>	 10SRE, 10ops-eqiad, 10Traffic: Relocate lvs1013-lvs1016 to rows E & F - https://phabricator.wikimedia.org/T341992 (10cmooney) Thanks @ayounsi   @RobH you can probably connect them to 44 and 45 instead.
[17:20:46] <icinga-wm>	 PROBLEM - Bird Internet Routing Daemon on dns5004 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[17:22:24] <icinga-wm>	 PROBLEM - Check systemd state on dns5004 is CRITICAL: CRITICAL - degraded: The following units failed: anycast-healthchecker.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:22:47] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host druid1010.eqiad.wmnet
[17:23:01] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host druid1011.eqiad.wmnet
[17:23:30] <icinga-wm>	 PROBLEM - Check if anycast-healthchecker and all configured threads are running on dns5004 is CRITICAL: CRITICAL: anycast-healthchecker could be down as pid file /var/run/anycast-healthchecker/anycast-healthchecker.pid doesnt exist https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running
[17:25:15] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack C8 - https://phabricator.wikimedia.org/T308339 (10RobH)
[17:27:01] <wikibugs>	 (03PS1) 10Ssingh: Revert "dns5004: temporarily remove from authdns_servers for restart" [puppet] - 10https://gerrit.wikimedia.org/r/939343
[17:27:40] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: UP: 14 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[17:27:46] <icinga-wm>	 RECOVERY - BFD status on cr3-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[17:27:56] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] Revert "dns5004: temporarily remove from authdns_servers for restart" [puppet] - 10https://gerrit.wikimedia.org/r/939343 (owner: 10Ssingh)
[17:27:58] <icinga-wm>	 RECOVERY - Check if anycast-healthchecker and all configured threads are running on dns5004 is OK: OK: UP (pid=4539) and all threads (2) are running https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running
[17:28:14] <icinga-wm>	 RECOVERY - Bird Internet Routing Daemon on dns5004 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[17:28:24] <icinga-wm>	 RECOVERY - Check systemd state on dns5004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:29:26] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host druid1011.eqiad.wmnet
[17:32:47] <sukhe>	 !log dummy run of authdns-update
[17:32:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:37:42] <wikibugs>	 (03PS1) 10Eevans: deployment-prep: Upgrade restbase04 to Cassandra 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/939750 (https://phabricator.wikimedia.org/T313814)
[17:38:08] <icinga-wm>	 PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:38:22] <sukhe>	 ^ expected
[17:38:50] <wikibugs>	 (03CR) 10Eevans: [C: 03+2] deployment-prep: Upgrade restbase04 to Cassandra 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/939750 (https://phabricator.wikimedia.org/T313814) (owner: 10Eevans)
[17:38:59] <icinga-wm>	 PROBLEM - Host db1218 #page is DOWN: PING CRITICAL - Packet loss = 100%
[17:39:27] <sukhe>	 er
[17:39:47] <sukhe>	 I am depooling 
[17:39:57] <herron>	 thanks
[17:40:09] <sukhe>	 !log depool db1218
[17:40:10] <icinga-wm>	 PROBLEM - BGP status on cr3-esams is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:40:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:40:20] <logmsgbot>	 !log sukhe@cumin2002 dbctl commit (dc=all): 'Depool db1218', diff saved to https://phabricator.wikimedia.org/P49603 and previous config saved to /var/cache/conftool/dbconfig/20230719-174019-sukhe.json
[17:40:38] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Cyndymediawiksim - https://phabricator.wikimedia.org/T342230 (10Cyndymediawiksim) >>! In T342230#9028928, @Aklapper wrote: > @Cyndymediawiksim I am sorry that I was unclear. "MediaWiki OAuth1 Account" on https://phabricator.wikimedia.org/settings/panel/ext...
[17:41:02] <sukhe>	 herron: done
[17:41:11] <herron>	 ack, thank you sukhe 
[17:41:12] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.reimage for host lvs1013.eqiad.wmnet with OS bullseye
[17:41:23] <wikibugs>	 10SRE, 10ops-eqiad, 10Traffic: Relocate lvs1013-lvs1016 to rows E & F - https://phabricator.wikimedia.org/T341992 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host lvs1013.eqiad.wmnet with OS bullseye
[17:41:24] <sukhe>	 not sure who to make aware of this
[17:41:28] <wikibugs>	 10SRE, 10ops-knams, 10DC-Ops: Relocate one of the mx480 from knams to esams - https://phabricator.wikimedia.org/T342198 (10wiki_willy) Cool, thanks for confirming @Papaul.  Hopefully Iron Mountain will come back with the same confirmation as well.
[17:41:40] <icinga-wm>	 RECOVERY - BGP status on cr3-esams is OK: BGP OK - up: 20, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:42:09] <RhinosF1>	 sukhe: Amir1 would be a DBA around
[17:42:23] <Amir1>	 sukhe: what's up?
[17:42:30] <sukhe>	 Amir1: 
[17:42:30] <sukhe>	 13:38:58 <+icinga-wm> PROBLEM - Host db1218 #page is DOWN: PING CRITICAL - Packet loss = 100%
[17:42:33] <sukhe>	 depooled it
[17:42:37] <Amir1>	 sigh, thanks
[17:42:51] <RhinosF1>	 That's candidate master for s1 Amir1
[17:43:05] <Amir1>	 yes, I'm aware
[17:43:26] <wikibugs>	 10SRE, 10ops-eqiad, 10Traffic: Relocate lvs1013-lvs1016 to rows E & F - https://phabricator.wikimedia.org/T341992 (10RobH) >>! In T341992#9026925, @Vgutierrez wrote: > @RobH I'm seeing on cumin1001 logs, that you interrupted the reimage of lvs1013 by pressing Ctrl+C: > ` > 2023-07-18 16:01:28,549 robh 203485...
[17:44:19] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on db1218.eqiad.wmnet with reason: Maint
[17:44:32] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1218.eqiad.wmnet with reason: Maint
[17:45:01] <Amir1>	 I can't ssh into it, gonna do a powercycle
[17:45:03] <wikibugs>	 10SRE, 10ops-eqiad, 10Traffic: Relocate lvs1013-lvs1016 to rows E & F - https://phabricator.wikimedia.org/T341992 (10RobH) a:03RobH >>! In T341992#9029076, @ayounsi wrote: > FYI this Netbox report is alerting: > https://netbox.wikimedia.org/extras/reports/results/4808787/#test_port_block_consistency > ` >...
[17:45:40] <icinga-wm>	 PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:45:44] <wikibugs>	 (03PS1) 10Ayounsi: Add includes for lsw1-e8-eqiad v6 PTR records [dns] - 10https://gerrit.wikimedia.org/r/939752
[17:46:02] <wikibugs>	 10SRE, 10ops-eqiad, 10Traffic: Relocate lvs1013-lvs1016 to rows E & F - https://phabricator.wikimedia.org/T341992 (10RobH) Ok, the Bullseye OS has issues with the drivers for some of the hardware...  Considering these are R430s, I don't think it is worth putting in time to install support for them in Bullsey...
[17:46:39] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add includes for lsw1-e8-eqiad v6 PTR records [dns] - 10https://gerrit.wikimedia.org/r/939752 (owner: 10Ayounsi)
[17:49:43] <Amir1>	 !log powercycled db1218 (T342284)
[17:49:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:49:47] <stashbot>	 T342284: db1218 crashed - https://phabricator.wikimedia.org/T342284
[17:50:54] <icinga-wm>	 PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:50:57] <logmsgbot>	 !log robh@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host lvs1013.eqiad.wmnet with OS bullseye
[17:51:12] <wikibugs>	 10SRE, 10ops-eqiad, 10Traffic: Relocate lvs1013-lvs1016 to rows E & F - https://phabricator.wikimedia.org/T341992 (10ayounsi) @RobH they will need to have their switch port moved.  On QFX5120s, if one port is configured at 1G, the 3 other adjacent ports can only be 1G.  Here port 40 and port 42 are configure...
[17:51:46] <icinga-wm>	 PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:53:17] <icinga-wm>	 RECOVERY - Host db1218 #page is UP: PING OK - Packet loss = 0%, RTA = 0.43 ms
[17:53:44] <wikibugs>	 10SRE, 10ops-knams, 10DC-Ops: Relocate one of the mx480 from esams  to knams - https://phabricator.wikimedia.org/T342198 (10Papaul)
[17:53:50] <wikibugs>	 (03PS2) 10Ayounsi: Add includes for lsw1-e8-eqiad PTR records [dns] - 10https://gerrit.wikimedia.org/r/939752
[17:53:56] <Amir1>	 up now
[17:54:06] <sukhe>	 good ol' power cycle
[17:54:10] <wikibugs>	 10SRE, 10ops-knams, 10DC-Ops: Relocate one of the mx480 from esams  to knams - https://phabricator.wikimedia.org/T342198 (10Papaul) @wiki_willy i think i made a mistake that i just fixed that confirmation is from esams not from knams. thanks
[17:54:20] <sukhe>	 racadm logs say something?
[17:55:02] <wikibugs>	 (03CR) 10Cathal Mooney: "LGTM!" [dns] - 10https://gerrit.wikimedia.org/r/939752 (owner: 10Ayounsi)
[17:55:18] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] Add includes for lsw1-e8-eqiad PTR records [dns] - 10https://gerrit.wikimedia.org/r/939752 (owner: 10Ayounsi)
[17:55:19] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Add includes for lsw1-e8-eqiad PTR records [dns] - 10https://gerrit.wikimedia.org/r/939752 (owner: 10Ayounsi)
[17:56:54] <icinga-wm>	 PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:57:48] <icinga-wm>	 PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:58:10] <logmsgbot>	 !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[17:59:46] <XioNoX>	 sukhe: ^ 😬
[18:00:04] <jouncebot>	 dancy and dduvall: Dear deployers, time to do the Train log triage with CPT deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230719T1800).
[18:00:04] <jouncebot>	 dancy and dduvall: Dear deployers, time to do the MediaWiki train - Utc-7 Version deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230719T1800).
[18:00:08] <sukhe>	 XioNoX: do you mean the BGP alerts or the DNS one!
[18:00:34] <XioNoX>	 sukhe: the cookbook I forgot 
[18:00:52] <sukhe>	 ah ha
[18:01:52] <wikibugs>	 (03CR) 10BCornwall: "The naming convention would make this file roll-restart-reboot-durum.py. Not a good naming convention for sure, but that's what the other " [cookbooks] - 10https://gerrit.wikimedia.org/r/939704 (owner: 10Ssingh)
[18:04:17] <wikibugs>	 (03PS3) 10Ssingh: sre.dns: add a new cookbook for durum reboot/service restarts [cookbooks] - 10https://gerrit.wikimedia.org/r/939704
[18:06:46] <icinga-wm>	 PROBLEM - BFD status on cr3-eqsin is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[18:07:07] <wikibugs>	 (03CR) 10Ssingh: sre.dns: add a new cookbook for durum reboot/service restarts (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/939704 (owner: 10Ssingh)
[18:07:54] <wikibugs>	 (03CR) 10Ssingh: sre.dns: add a new cookbook for durum reboot/service restarts (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/939704 (owner: 10Ssingh)
[18:08:14] <icinga-wm>	 RECOVERY - BFD status on cr3-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[18:11:56] <wikibugs>	 10SRE, 10ops-eqiad, 10Traffic: Relocate lvs1013-lvs1016 to rows E & F - https://phabricator.wikimedia.org/T341992 (10RobH) >>! In T341992#9029218, @ayounsi wrote: > @RobH they will need to have their switch port moved. >  > On QFX5120s, if one port is configured at 1G, the 3 other adjacent ports can only be...
[18:13:38] <dancy>	 It's train time!
[18:14:31] <wikibugs>	 (03PS1) 10TrainBranchBot: group1 wikis to 1.41.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939753 (https://phabricator.wikimedia.org/T340246)
[18:14:33] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.41.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939753 (https://phabricator.wikimedia.org/T340246) (owner: 10TrainBranchBot)
[18:15:52] <wikibugs>	 (03Merged) 10jenkins-bot: group1 wikis to 1.41.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939753 (https://phabricator.wikimedia.org/T340246) (owner: 10TrainBranchBot)
[18:22:08] <icinga-wm>	 PROBLEM - BGP status on asw1-b12-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv6: Connect - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[18:22:28] <wikibugs>	 (03PS2) 10ArielGlenn: Make sure that rsync runs only on the primary dumps nfs shares [puppet] - 10https://gerrit.wikimedia.org/r/939674 (https://phabricator.wikimedia.org/T325232)
[18:22:51] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Make sure that rsync runs only on the primary dumps nfs shares [puppet] - 10https://gerrit.wikimedia.org/r/939674 (https://phabricator.wikimedia.org/T325232) (owner: 10ArielGlenn)
[18:23:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PUT secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[18:23:36] <icinga-wm>	 RECOVERY - BGP status on asw1-b12-drmrs.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[18:23:46] <wikibugs>	 (03PS1) 10Krinkle: Profiler: Remove "toobig" filter from Arc Lamp ingestion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939755 (https://phabricator.wikimedia.org/T337873)
[18:23:48] <wikibugs>	 (03PS1) 10Krinkle: Profiler: Sync minor changes with arc-lamp.git package [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939756 (https://phabricator.wikimedia.org/T337873)
[18:24:00] <dancy>	 Does anyone know what's up with parse1002?
[18:24:22] <Krinkle>	 dancy: https://sal.toolforge.org/production?p=0&q=parse1002&d=
[18:24:37] <Krinkle>	 looks like it had an issue this morning
[18:24:41] <Krinkle>	 what are you seeing now?
[18:24:57] <dancy>	 ssh connections timing out.
[18:25:03] <dancy>	 (during train deployment)
[18:25:23] <dancy>	 Unresponsive to ping
[18:25:34] <logmsgbot>	 !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.41.0-wmf.18  refs T340246
[18:25:38] <stashbot>	 T340246: 1.41.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T340246
[18:28:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[18:29:32] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough and A:wikidough
[18:31:06] <wikibugs>	 (03CR) 10BCornwall: [V: 03+1 C: 03+1] "Looks great, and runs as expected!" [cookbooks] - 10https://gerrit.wikimedia.org/r/939704 (owner: 10Ssingh)
[18:32:40] <wikibugs>	 (03CR) 10BCornwall: [V: 03+1 C: 03+1] sre.dns: add a new cookbook for durum reboot/service restarts (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/939704 (owner: 10Ssingh)
[18:33:13] <wikibugs>	 (03CR) 10Ssingh: sre.dns: add a new cookbook for durum reboot/service restarts (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/939704 (owner: 10Ssingh)
[18:36:39] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] sre.dns: add a new cookbook for durum reboot/service restarts [cookbooks] - 10https://gerrit.wikimedia.org/r/939704 (owner: 10Ssingh)
[18:41:18] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.dns.roll-restart-reboot-durum rolling reboot on A:durum and A:durum
[18:43:40] <wikibugs>	 (03PS1) 10Jforrester: Remove wikifunctions.org Varnish 302 [puppet] - 10https://gerrit.wikimedia.org/r/939757 (https://phabricator.wikimedia.org/T275945)
[18:45:02] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frbast1002, frmon1002, frpig1002 - https://phabricator.wikimedia.org/T319460 (10Dwisehaupt)
[18:45:51] <wikibugs>	 (03PS3) 10ArielGlenn: Make sure that rsync runs only on the primary dumps nfs shares [puppet] - 10https://gerrit.wikimedia.org/r/939674 (https://phabricator.wikimedia.org/T325232)
[18:46:09] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frbast1002, frmon1002, frpig1002 - https://phabricator.wikimedia.org/T319460 (10Dwisehaupt) 05Open→03Resolved
[18:49:16] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[18:49:22] <icinga-wm>	 PROBLEM - BFD status on cr2-codfw is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[18:49:32] <sukhe>	 expected BGP alerts in all sites
[18:49:44] <sukhe>	 durum restart, I am monitoring in case something else comes up
[18:50:48] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 111, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[18:50:54] <icinga-wm>	 RECOVERY - BFD status on cr2-codfw is OK: UP: 20 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[18:54:00] <wikibugs>	 (03CR) 10JHathaway: "Would it be worth considering a systemd adhoc timer that would trigger some time after the puppet run is complete? e.g. systemd-run --on-a" [puppet] - 10https://gerrit.wikimedia.org/r/939643 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond)
[18:57:15] <wikibugs>	 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10RobH)
[18:58:00] <wikibugs>	 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10RobH)
[18:58:51] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host pybal-test2003.codfw.wmnet
[19:03:30] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host pybal-test2003.codfw.wmnet
[19:04:48] <icinga-wm>	 PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:06:16] <icinga-wm>	 RECOVERY - BFD status on cr1-eqiad is OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:15:38] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:15:50] <icinga-wm>	 PROBLEM - BFD status on cr3-eqsin is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:17:06] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: UP: 14 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:17:18] <icinga-wm>	 RECOVERY - BFD status on cr3-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:21:36] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:21:48] <icinga-wm>	 PROBLEM - BFD status on cr3-eqsin is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:23:06] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: UP: 14 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:23:18] <icinga-wm>	 RECOVERY - BFD status on cr3-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:24:53] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.decommission for hosts flink-zk1003.eqiad.wmnet
[19:26:02] <icinga-wm>	 PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[19:27:58] <icinga-wm>	 PROBLEM - BFD status on cr2-esams is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:28:28] <icinga-wm>	 PROBLEM - BFD status on cr3-esams is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:28:55] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.dns.netbox
[19:29:28] <icinga-wm>	 RECOVERY - BFD status on cr2-esams is OK: UP: 18 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:29:58] <icinga-wm>	 RECOVERY - BFD status on cr3-esams is OK: UP: 17 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:32:00] <icinga-wm>	 PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[19:32:40] <wikibugs>	 (03PS4) 10ArielGlenn: Make sure that rsync runs only on the primary dumps nfs shares [puppet] - 10https://gerrit.wikimedia.org/r/939674 (https://phabricator.wikimedia.org/T325232)
[19:33:56] <icinga-wm>	 PROBLEM - BFD status on cr2-esams is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:34:26] <icinga-wm>	 PROBLEM - BFD status on cr3-esams is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:34:32] <sukhe>	 wish there was a way to silence these alerts
[19:34:35] <sukhe>	 but alas
[19:34:50] <sukhe>	 in some ways silencing them is not desirable
[19:35:25] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: flink-zk1003.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin1001"
[19:35:26] <icinga-wm>	 RECOVERY - BFD status on cr2-esams is OK: UP: 18 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:35:56] <icinga-wm>	 RECOVERY - BFD status on cr3-esams is OK: UP: 17 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:36:40] <icinga-wm>	 PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[19:36:43] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: flink-zk1003.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin1001"
[19:36:43] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[19:36:44] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts flink-zk1003.eqiad.wmnet
[19:36:49] <wikibugs>	 10SRE, 10Data-Platform-SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by bking@cumin1001 for hosts: `flink-zk1003.eqiad.wmnet` - flink-zk1003.eqiad.wmnet...
[19:37:16] <icinga-wm>	 PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast, AS64605/IPv6: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[19:37:20] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.ganeti.makevm for new host flink-zk1003.eqiad.wmnet
[19:37:22] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.dns.netbox
[19:39:43] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM flink-zk1003.eqiad.wmnet - bking@cumin1001"
[19:40:29] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM flink-zk1003.eqiad.wmnet - bking@cumin1001"
[19:40:29] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[19:40:29] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.dns.wipe-cache flink-zk1003.eqiad.wmnet on all recursors
[19:40:32] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) flink-zk1003.eqiad.wmnet on all recursors
[19:40:58] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM flink-zk1003.eqiad.wmnet - bking@cumin1001"
[19:41:42] <icinga-wm>	 PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast, AS64605/IPv6: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[19:41:42] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM flink-zk1003.eqiad.wmnet - bking@cumin1001"
[19:42:08] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host flink-zk1003.eqiad.wmnet with OS bookworm
[19:42:16] <wikibugs>	 10SRE, 10Data-Platform-SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host flink-zk1003.eqiad.wmnet with OS bookworm
[19:42:18] <wikibugs>	 (03PS3) 10Hubaishan: Replace underscores with spaces in 4 Arabic sitenames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927713 (https://phabricator.wikimedia.org/T337725)
[19:42:36] <icinga-wm>	 PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[19:45:20] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling reboot on A:durum and A:durum
[19:47:40] <sukhe>	 that should be all the bgp alerts 
[19:50:42] <wikibugs>	 (03PS5) 10ArielGlenn: Make sure that rsync runs only on the primary dumps nfs shares [puppet] - 10https://gerrit.wikimedia.org/r/939674 (https://phabricator.wikimedia.org/T325232)
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230719T2000).
[20:00:05] <jouncebot>	 hubaishan and kimberly_sarabia: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:19] <kimberly_sarabia>	 hello
[20:01:02] <TheresNoTime>	 I can deploy!
[20:01:10] <TheresNoTime>	 kimberly_sarabia: I'll do yours first, as its a beta-only change :)
[20:01:20] <kimberly_sarabia>	 ty
[20:01:26] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936333 (https://phabricator.wikimedia.org/T337956) (owner: 10Kimberly Sarabia)
[20:01:50] <wikibugs>	 (03PS1) 10Eevans: cassandra: prevent malformed config when tls_cluster_name is unset [puppet] - 10https://gerrit.wikimedia.org/r/939763 (https://phabricator.wikimedia.org/T313814)
[20:02:09] <wikibugs>	 (03Merged) 10jenkins-bot: Turn off A/B Test in Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936333 (https://phabricator.wikimedia.org/T337956) (owner: 10Kimberly Sarabia)
[20:03:29] <TheresNoTime>	 kimberly_sarabia: done, it'll be live on beta in a few minutes :)
[20:03:54] <kimberly_sarabia>	 TheresNoTime: Ok! tysm
[20:04:28] <wikibugs>	 (03PS2) 10Eevans: cassandra: prevent malformed config when tls_cluster_name is unset [puppet] - 10https://gerrit.wikimedia.org/r/939763 (https://phabricator.wikimedia.org/T313814)
[20:04:44] <wikibugs>	 (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/939763 (https://phabricator.wikimedia.org/T313814) (owner: 10Eevans)
[20:05:39] <hubaishan>	 Hello
[20:06:16] <wikibugs>	 (03CR) 10Samtar: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927713 (https://phabricator.wikimedia.org/T337725) (owner: 10Hubaishan)
[20:06:19] <TheresNoTime>	 hubaishan: hi! Just looking at your patch now :) have you done a backport before?
[20:06:39] <hubaishan>	 No
[20:07:34] <TheresNoTime>	 No problem :) first things first, have you read https://wikitech.wikimedia.org/wiki/Backport_windows#Doing_the_deploy and do you have https://wikitech.wikimedia.org/wiki/WikimediaDebug installed?
[20:08:54] <wikibugs>	 (03PS4) 10Samtar: Replace underscores with spaces in 4 Arabic sitenames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927713 (https://phabricator.wikimedia.org/T337725) (owner: 10Hubaishan)
[20:09:41] <hubaishan>	 https://wikitech.wikimedia.org/wiki/WikimediaDebug installed is installed
[20:10:08] <TheresNoTime>	 great, let's start :) I'll let you know when to test, and we can look at testing it together
[20:10:11] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927713 (https://phabricator.wikimedia.org/T337725) (owner: 10Hubaishan)
[20:10:55] <wikibugs>	 (03Merged) 10jenkins-bot: Replace underscores with spaces in 4 Arabic sitenames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927713 (https://phabricator.wikimedia.org/T337725) (owner: 10Hubaishan)
[20:11:24] <logmsgbot>	 !log samtar@deploy1002 Started scap: Backport for [[gerrit:927713|Replace underscores with spaces in 4 Arabic sitenames (T337725)]]
[20:11:27] <stashbot>	 T337725: Replace underscores with spaces in Arabic Wikimedia project sitenames - https://phabricator.wikimedia.org/T337725
[20:12:58] <logmsgbot>	 !log samtar@deploy1002 samtar and hubaishan: Backport for [[gerrit:927713|Replace underscores with spaces in 4 Arabic sitenames (T337725)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[20:13:57] <TheresNoTime>	 hubaishan: okay, that change is live on mwdebug. You can use the WikimediaDebug extension to pick any of the `mwdebug` servers and test. For example, I can see that https://ar.wikisource.org/wiki/%D9%85%D8%B3%D8%AA%D8%AE%D8%AF%D9%85:TheresNoTime/Test normally shows `ويكي_مصدر`, but when using mwdebug, it shows `ويكي مصدر` (after resaving the page)
[20:14:59] <TheresNoTime>	 (that's seeing the output of `{{SITENAME}}` by the way)
[20:15:24] <TheresNoTime>	 Once you're happy that your patch works as expected, let me know and we can sync it :)
[20:16:01] <hubaishan>	 It is Good :]
[20:16:36] <TheresNoTime>	 Awesome, syncing now
[20:16:42] <TheresNoTime>	 Once that
[20:17:03] <TheresNoTime>	 Once that's done, which can take a while, I'll let you know and you can test again without using `mwdebug` :)
[20:22:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[20:22:38] <TheresNoTime>	 Noting that I've had 1 failure during `sync-apaches`, logged at https://phabricator.wikimedia.org/P49604
[20:22:39] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[20:24:30] <TheresNoTime>	 same host during `scap-cdb-rebuild`, — `parse1002.eqiad.wmnet`
[20:24:55] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add reverse dns for spine linknets eqiad - cmooney@cumin1001"
[20:25:26] <RhinosF1>	 TheresNoTime: I think parse1002 needs a depool
[20:25:32] <RhinosF1>	 It had issues during the train
[20:26:29] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add reverse dns for spine linknets eqiad - cmooney@cumin1001"
[20:26:29] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[20:27:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[20:28:33] <logmsgbot>	 !log samtar@deploy1002 Finished scap: Backport for [[gerrit:927713|Replace underscores with spaces in 4 Arabic sitenames (T337725)]] (duration: 17m 09s)
[20:28:36] <stashbot>	 T337725: Replace underscores with spaces in Arabic Wikimedia project sitenames - https://phabricator.wikimedia.org/T337725
[20:28:37] <TheresNoTime>	 hubaishan: okay, can you test again, but this time make sure the `mwdebug` toggle is switched off :)
[20:29:42] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install titan100[12] - https://phabricator.wikimedia.org/T342179 (10RobH)
[20:31:35] <TheresNoTime>	 !log backport window closed
[20:31:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:33:21] <hubaishan>	 TheresNoTime it is OK.
[20:33:37] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host flink-zk1003.eqiad.wmnet with OS bookworm
[20:33:37] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host flink-zk1003.eqiad.wmnet
[20:33:39] <TheresNoTime>	 great, all done then! :) Thank you for the patch
[20:33:43] <wikibugs>	 10SRE, 10Data-Platform-SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host flink-zk1003.eqiad.wmnet with OS bookworm executed w...
[20:34:14] <wikibugs>	 10ops-codfw, 10DC-Ops, 10observability: Q1:rack/setup/install titan200[12] - https://phabricator.wikimedia.org/T342300 (10RobH)
[20:34:22] <wikibugs>	 10ops-codfw, 10DC-Ops, 10observability: Q1:rack/setup/install titan200[12] - https://phabricator.wikimedia.org/T342300 (10RobH)
[20:38:01] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.ganeti.makevm for new host flink-zk1003.eqiad.wmnet
[20:38:02] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.dns.netbox
[20:39:03] <wikibugs>	 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: Cabling for Eqiad racke E5-7 and F5-7 - https://phabricator.wikimedia.org/T334231 (10cmooney) @Jclark-ctr my apologies for some reason I thought these links had been cabled but seems from T338789 I didn't update the optic type so we need got them...
[20:39:24] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[20:39:26] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host flink-zk1003.eqiad.wmnet
[20:39:34] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.decommission for hosts flink-zk1003.eqiad.wmnet
[20:43:29] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.dns.netbox
[20:54:07] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: flink-zk1003.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin1001"
[20:55:00] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: flink-zk1003.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin1001"
[20:55:00] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[20:55:01] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts flink-zk1003.eqiad.wmnet
[20:55:08] <wikibugs>	 10SRE, 10Data-Platform-SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by bking@cumin1001 for hosts: `flink-zk1003.eqiad.wmnet` - flink-zk1003.eqiad.wmnet...
[21:00:05] <jouncebot>	 Deploy window Wikifunction Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230719T2100)
[21:20:01] <logmsgbot>	 !log bking@deploy1002 Started deploy [wdqs/wdqs@dff41b7]: 0.3.124
[21:21:49] <logmsgbot>	 !log bking@deploy1002 Finished deploy [wdqs/wdqs@dff41b7]: 0.3.124 (duration: 01m 47s)
[21:26:21] <logmsgbot>	 !log bking@deploy1002 Started deploy [wdqs/wdqs@dff41b7]: 0.3.124
[21:27:26] <logmsgbot>	 !log bking@deploy1002 Finished deploy [wdqs/wdqs@dff41b7]: 0.3.124 (duration: 01m 05s)
[21:32:55] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.ganeti.makevm for new host flink-zk1003.eqiad.wmnet
[21:32:56] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.dns.netbox
[21:36:43] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM flink-zk1003.eqiad.wmnet - bking@cumin1001"
[21:37:29] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM flink-zk1003.eqiad.wmnet - bking@cumin1001"
[21:37:29] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[21:37:29] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.dns.wipe-cache flink-zk1003.eqiad.wmnet on all recursors
[21:37:32] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) flink-zk1003.eqiad.wmnet on all recursors
[21:37:59] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM flink-zk1003.eqiad.wmnet - bking@cumin1001"
[21:38:43] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM flink-zk1003.eqiad.wmnet - bking@cumin1001"
[21:41:00] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host flink-zk1003.eqiad.wmnet with OS bookworm
[21:41:08] <wikibugs>	 10SRE, 10Data-Platform-SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host flink-zk1003.eqiad.wmnet with OS bookworm
[21:42:00] <wikibugs>	 10SRE, 10Traffic, 10cloud-services-team: Rename references of labweb to cloudweb - https://phabricator.wikimedia.org/T317463 (10BCornwall) I would think that this needs to be followed since it's technically a new service even it's a rename. For instance, the dns repo still has "labweb" in templates/wmnet.  A...
[21:44:40] <wikibugs>	 10SRE, 10Traffic, 10cloud-services-team: Rename references of labweb to cloudweb - https://phabricator.wikimedia.org/T317463 (10taavi) Yeah, the above patches were just getting rid of the non-TLS endpoint so we have one service to rename instead of two. The actual rename still needs to be done.
[22:08:55] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to ops for taavi - https://phabricator.wikimedia.org/T342307 (10taavi)
[22:10:08] <icinga-wm>	 PROBLEM - Router interfaces on cr2-drmrs is CRITICAL: CRITICAL: host 185.15.58.129, interfaces up: 61, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[22:10:46] <icinga-wm>	 PROBLEM - Router interfaces on cr3-esams is CRITICAL: CRITICAL: host 91.198.174.245, interfaces up: 83, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[22:36:26] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host flink-zk1003.eqiad.wmnet with OS bookworm
[22:36:27] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host flink-zk1003.eqiad.wmnet
[22:36:33] <wikibugs>	 10SRE, 10Data-Platform-SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host flink-zk1003.eqiad.wmnet with OS bookworm executed w...
[23:57:06] <icinga-wm>	 RECOVERY - Router interfaces on cr3-esams is OK: OK: host 91.198.174.245, interfaces up: 84, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[23:57:50] <icinga-wm>	 RECOVERY - Router interfaces on cr2-drmrs is OK: OK: host 185.15.58.129, interfaces up: 62, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down