[00:06:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:31:36] (GitLabCIPipelineErrors) firing: GitLab - High pipeline error rate - https://wikitech.wikimedia.org/wiki/GitLab/Runbook - https://grafana.wikimedia.org/d/Chb-gC07k/gitlab-ci-overview - https://alerts.wikimedia.org/?q=alertname%3DGitLabCIPipelineErrors [00:36:36] (GitLabCIPipelineErrors) resolved: GitLab - High pipeline error rate - https://wikitech.wikimedia.org/wiki/GitLab/Runbook - https://grafana.wikimedia.org/d/Chb-gC07k/gitlab-ci-overview - https://alerts.wikimedia.org/?q=alertname%3DGitLabCIPipelineErrors [00:38:28] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/939271 [00:38:34] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/939271 (owner: 10TrainBranchBot) [00:53:35] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/939271 (owner: 10TrainBranchBot) [01:14:52] (03CR) 10Gergő Tisza: "Do you want to update the edit summary, now that the patch does a bunch of changes not really related to cswiki? And maybe mention the DB " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938915 (https://phabricator.wikimedia.org/T342034) (owner: 10Urbanecm) [01:27:54] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-ext_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:30:50] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [01:53:36] (GitLabCIPipelineErrors) firing: GitLab - High pipeline error rate - https://wikitech.wikimedia.org/wiki/GitLab/Runbook - https://grafana.wikimedia.org/d/Chb-gC07k/gitlab-ci-overview - https://alerts.wikimedia.org/?q=alertname%3DGitLabCIPipelineErrors [01:56:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [01:58:14] 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T342197 (10phaultfinder) [01:58:36] (GitLabCIPipelineErrors) resolved: GitLab - High pipeline error rate - https://wikitech.wikimedia.org/wiki/GitLab/Runbook - https://grafana.wikimedia.org/d/Chb-gC07k/gitlab-ci-overview - https://alerts.wikimedia.org/?q=alertname%3DGitLabCIPipelineErrors [02:00:31] 10SRE, 10ops-knams, 10DC-Ops: Relocate one of the mx480 from knams to esams - https://phabricator.wikimedia.org/T342198 (10Papaul) [02:01:03] 10SRE, 10ops-knams, 10DC-Ops: Relocate one of the mx480 from knams to esams - https://phabricator.wikimedia.org/T342198 (10Papaul) p:05Triage→03Medium a:05wiki_willy→03Papaul [02:01:15] 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T342197 (10phaultfinder) [02:03:15] 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T342197 (10phaultfinder) [02:06:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [02:06:32] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:08:16] 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T342197 (10phaultfinder) [02:09:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [02:13:25] (SystemdUnitFailed) firing: (2) load-categories-daily.service Failed on wdqs2021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:14:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [02:17:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [02:24:40] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:28:38] PROBLEM - Check systemd state on dumpsdata1003 is CRITICAL: CRITICAL - degraded: The following units failed: cleanup_tmpdumps.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:31:32] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:33:44] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [02:57:55] (03PS1) 10Sohom Datta: Enable EditInSequence in pawikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939392 (https://phabricator.wikimedia.org/T341786) [03:23:37] (03CR) 10Tim Starling: IP Masking: Enable for cswiki beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938915 (https://phabricator.wikimedia.org/T342034) (owner: 10Urbanecm) [03:31:06] PROBLEM - Host parse1002 is DOWN: PING CRITICAL - Packet loss = 100% [04:17:45] (03PS1) 10David Martin: Create puppet scripting for sqooping Wikifunctions tables [puppet] - 10https://gerrit.wikimedia.org/r/939394 (https://phabricator.wikimedia.org/T342199) [04:34:59] (03PS1) 10Marostegui: Revert "db1198: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/939337 [04:37:17] (03CR) 10Marostegui: [C: 03+2] Revert "db1198: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/939337 (owner: 10Marostegui) [04:37:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1198 (re)pooling @ 1%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49574 and previous config saved to /var/cache/conftool/dbconfig/20230719-043740-root.json [04:37:58] 10SRE, 10ops-eqiad, 10DBA: db1198 crashed - https://phabricator.wikimedia.org/T342129 (10Marostegui) Host being repooed. [04:52:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1198 (re)pooling @ 3%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49575 and previous config saved to /var/cache/conftool/dbconfig/20230719-045245-root.json [05:07:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1198 (re)pooling @ 5%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49576 and previous config saved to /var/cache/conftool/dbconfig/20230719-050750-root.json [05:22:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1198 (re)pooling @ 10%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49577 and previous config saved to /var/cache/conftool/dbconfig/20230719-052254-root.json [05:27:51] (03PS4) 10Abijeet Patro: Add channel for TtmServerMessageUpdate of Translate extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927701 [05:38:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1198 (re)pooling @ 25%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49578 and previous config saved to /var/cache/conftool/dbconfig/20230719-053759-root.json [05:53:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1198 (re)pooling @ 50%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49579 and previous config saved to /var/cache/conftool/dbconfig/20230719-055304-root.json [06:00:04] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230719T0600) [06:08:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1198 (re)pooling @ 75%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49580 and previous config saved to /var/cache/conftool/dbconfig/20230719-060809-root.json [06:08:33] 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T342197 (10phaultfinder) [06:14:23] (SystemdUnitFailed) firing: (2) load-categories-daily.service Failed on wdqs2021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:17:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:23:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1198 (re)pooling @ 100%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49581 and previous config saved to /var/cache/conftool/dbconfig/20230719-062313-root.json [07:00:04] Amir1, Urbanecm, and taavi: (Dis)respected human, time to deploy UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230719T0700). Please do the needful. [07:00:05] dcausse and abijeet: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:38] o/ [07:02:23] o/ [07:09:34] I suppose I can deploy unless there are objections [07:12:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2158', diff saved to https://phabricator.wikimedia.org/P49582 and previous config saved to /var/cache/conftool/dbconfig/20230719-071204-root.json [07:12:45] (03PS10) 10JMeybohm: kubernetes::master: Publish service-account cert to etcd [puppet] - 10https://gerrit.wikimedia.org/r/937442 (https://phabricator.wikimedia.org/T329826) [07:12:47] (03PS4) 10JMeybohm: kubernetes: Add etcd srv names to clusterconfig structure [puppet] - 10https://gerrit.wikimedia.org/r/937793 (https://phabricator.wikimedia.org/T329826) [07:12:55] (03CR) 10JMeybohm: kubernetes::master: Publish service-account cert to etcd (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/937442 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [07:15:08] (03PS1) 10Marostegui: db2158: Migrate to MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/939622 (https://phabricator.wikimedia.org/T334650) [07:15:10] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42546/console" [puppet] - 10https://gerrit.wikimedia.org/r/937442 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [07:15:42] (03CR) 10Marostegui: [C: 03+2] db2158: Migrate to MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/939622 (https://phabricator.wikimedia.org/T334650) (owner: 10Marostegui) [07:15:45] abijeet: deploying your config change [07:17:24] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by dcausse@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927701 (owner: 10Abijeet Patro) [07:17:45] dcausse, ok, thanks! [07:17:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2158 (re)pooling @ 1%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49583 and previous config saved to /var/cache/conftool/dbconfig/20230719-071755-root.json [07:18:03] (03Merged) 10jenkins-bot: Add channel for TtmServerMessageUpdate of Translate extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927701 (owner: 10Abijeet Patro) [07:18:49] !log dcausse@deploy1002 Started scap: Backport for [[gerrit:927701|Add channel for TtmServerMessageUpdate of Translate extension]] [07:18:52] abijeet: I suppose this is affecting code in ttm update jobs and thus can't be tested on mw-debug servers? [07:19:10] (03CR) 10JMeybohm: [C: 03+1] confd: allow running multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/937573 (https://phabricator.wikimedia.org/T341669) (owner: 10Giuseppe Lavagetto) [07:20:26] !log dcausse@deploy1002 dcausse and abi: Backport for [[gerrit:927701|Add channel for TtmServerMessageUpdate of Translate extension]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [07:22:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1180', diff saved to https://phabricator.wikimedia.org/P49584 and previous config saved to /var/cache/conftool/dbconfig/20230719-072207-root.json [07:22:35] abijeet: it's live on mw-debug please let me know if you want me to proceed [07:23:00] (03PS1) 10Marostegui: db1180: Migrate to mariadb 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/939623 (https://phabricator.wikimedia.org/T334650) [07:23:54] (03CR) 10Marostegui: [C: 03+2] db1180: Migrate to mariadb 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/939623 (https://phabricator.wikimedia.org/T334650) (owner: 10Marostegui) [07:24:16] dcausse, I think we can proceed [07:24:22] sure [07:26:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1180 (re)pooling @ 1%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49585 and previous config saved to /var/cache/conftool/dbconfig/20230719-072632-root.json [07:30:12] saw "connect to host parse1002.eqiad.wmnet port 22: Connection timed out" during sync-apaches, is this something we should be worried be about? [07:31:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:33:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2158 (re)pooling @ 3%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49586 and previous config saved to /var/cache/conftool/dbconfig/20230719-073300-root.json [07:36:34] !log dcausse@deploy1002 Finished scap: Backport for [[gerrit:927701|Add channel for TtmServerMessageUpdate of Translate extension]] (duration: 17m 44s) [07:36:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:39:31] abijeet: deploy done but I got warnings on the server parse1002.eqiad.wmnet, I believe this is unrelated to your change [07:40:18] dcausse, yea, i think so too. [07:41:17] dcausse, like you said the change just enables a log for ttm update jobs [07:41:26] log channel* [07:41:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1180 (re)pooling @ 3%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49587 and previous config saved to /var/cache/conftool/dbconfig/20230719-074137-root.json [07:44:18] <_joe_> are deployments still ongoing? [07:44:56] _joe_: scap backport ended [07:45:11] I have two patches to deploy but haven't started them yet [07:45:19] <_joe_> ok then gimmie a sec [07:45:41] !log oblivian@cumin1001 conftool action : set/pooled=inactive; selector: name=parse1002.eqiad.wmnet [07:45:50] <_joe_> dcausse: please proceed [07:45:53] _joe_: thanks! [07:46:38] !log dcausse@deploy1002 Backport cancelled. [07:47:06] <_joe_> !log powercycling parse1002, console blank, unreachable to network [07:47:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:47:25] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by dcausse@deploy1002 using scap backport" [extensions/CirrusSearch] (wmf/1.41.0-wmf.18) - 10https://gerrit.wikimedia.org/r/939328 (owner: 10DCausse) [07:48:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2158 (re)pooling @ 5%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49588 and previous config saved to /var/cache/conftool/dbconfig/20230719-074804-root.json [07:49:41] RECOVERY - Host parse1002 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [07:51:26] jouncebot: next [07:51:26] In 2 hour(s) and 8 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230719T1000) [07:51:52] <_joe_> dcausse: lmk when scap backport finished [07:52:00] sure [07:52:12] <_joe_> I'll repool parse1002 [07:54:15] might take time tho, waiting for CI on an extension [07:54:23] <_joe_> ah ok [07:54:27] <_joe_> then let me repool now [07:54:52] <_joe_> !log ran scap pull, pool on parse1002 after powercycling [07:54:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:11] PROBLEM - mediawiki-installation DSH group on parse1002 is CRITICAL: Host parse1002 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [07:56:18] <_joe_> shush [07:56:21] <_joe_> it's actually fixed [07:56:30] <_joe_> stupid icinga [07:56:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1180 (re)pooling @ 5%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49589 and previous config saved to /var/cache/conftool/dbconfig/20230719-075642-root.json [08:01:04] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "While I would like this to be more DRY, it's ok as a first addition." [puppet] - 10https://gerrit.wikimedia.org/r/928159 (https://phabricator.wikimedia.org/T323192) (owner: 10Abijeet Patro) [08:02:31] (03Merged) 10jenkins-bot: Use the LinksUpdate::isRecursive flag again to route cirrusSearchLinksUpdate [extensions/CirrusSearch] (wmf/1.41.0-wmf.18) - 10https://gerrit.wikimedia.org/r/939328 (owner: 10DCausse) [08:02:59] !log dcausse@deploy1002 Started scap: Backport for [[gerrit:939328|Use the LinksUpdate::isRecursive flag again to route cirrusSearchLinksUpdate]] [08:03:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2158 (re)pooling @ 10%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49590 and previous config saved to /var/cache/conftool/dbconfig/20230719-080309-root.json [08:04:29] !log dcausse@deploy1002 dcausse: Backport for [[gerrit:939328|Use the LinksUpdate::isRecursive flag again to route cirrusSearchLinksUpdate]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [08:10:35] !log dcausse@deploy1002 Finished scap: Backport for [[gerrit:939328|Use the LinksUpdate::isRecursive flag again to route cirrusSearchLinksUpdate]] (duration: 07m 36s) [08:11:01] _joe_: all good with the last deploy, thanks for the quick fix! [08:11:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1180 (re)pooling @ 10%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49591 and previous config saved to /var/cache/conftool/dbconfig/20230719-081146-root.json [08:11:48] going to extend the deploy window for another patch unless someone has objections [08:12:54] (03PS4) 10Jbond: monitoring: fix bashisms and other minor lint issues [puppet] - 10https://gerrit.wikimedia.org/r/938897 (https://phabricator.wikimedia.org/T95064) [08:13:11] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by dcausse@deploy1002 using scap backport" [extensions/CirrusSearch] (wmf/1.41.0-wmf.17) - 10https://gerrit.wikimedia.org/r/939327 (owner: 10DCausse) [08:13:15] (03CR) 10Jbond: "done thanks" [puppet] - 10https://gerrit.wikimedia.org/r/938897 (https://phabricator.wikimedia.org/T95064) (owner: 10Jbond) [08:13:42] 10SRE, 10Traffic: increased 5xx rate for esams frontend traffic - https://phabricator.wikimedia.org/T342121 (10TheDJ) This problem was also pretty visible on the wikimediastatus.net graph, I just noticed. {F37143438} [08:13:47] 10SRE, 10Traffic: increased 5xx rate for esams frontend traffic - https://phabricator.wikimedia.org/T342121 (10TheDJ) a:05TheDJ→03cmooney [08:15:25] (03PS5) 10Jbond: install_server: drop Bashisms [puppet] - 10https://gerrit.wikimedia.org/r/938898 (https://phabricator.wikimedia.org/T95064) [08:16:19] (03PS1) 10Giuseppe Lavagetto: mediawki::maintenance::translationnotifications: fix calendar defintions [puppet] - 10https://gerrit.wikimedia.org/r/939629 [08:16:34] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawki::maintenance::translationnotifications: fix calendar defintions [puppet] - 10https://gerrit.wikimedia.org/r/939629 (owner: 10Giuseppe Lavagetto) [08:17:34] (03CR) 10Abijeet Patro: [C: 03+1] mediawki::maintenance::translationnotifications: fix calendar defintions [puppet] - 10https://gerrit.wikimedia.org/r/939629 (owner: 10Giuseppe Lavagetto) [08:18:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2158 (re)pooling @ 25%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49592 and previous config saved to /var/cache/conftool/dbconfig/20230719-081814-root.json [08:20:52] (03PS11) 10JMeybohm: kubernetes::master: Publish service-account cert to etcd [puppet] - 10https://gerrit.wikimedia.org/r/937442 (https://phabricator.wikimedia.org/T329826) [08:20:54] (03PS5) 10JMeybohm: kubernetes: Add etcd srv names to clusterconfig structure [puppet] - 10https://gerrit.wikimedia.org/r/937793 (https://phabricator.wikimedia.org/T329826) [08:20:56] (03PS11) 10JMeybohm: confd: allow running multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/937573 (https://phabricator.wikimedia.org/T341669) (owner: 10Giuseppe Lavagetto) [08:20:58] (03PS1) 10JMeybohm: kubernetes::master: Add confd config writing all sa certs [puppet] - 10https://gerrit.wikimedia.org/r/939630 (https://phabricator.wikimedia.org/T329826) [08:22:06] (03PS1) 10Giuseppe Lavagetto: translationnotifications: brown paper bag fix [puppet] - 10https://gerrit.wikimedia.org/r/939631 [08:22:20] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] translationnotifications: brown paper bag fix [puppet] - 10https://gerrit.wikimedia.org/r/939631 (owner: 10Giuseppe Lavagetto) [08:22:33] (03PS2) 10JMeybohm: kubernetes::master: Add confd config writing all sa certs [puppet] - 10https://gerrit.wikimedia.org/r/939630 (https://phabricator.wikimedia.org/T329826) [08:24:45] (03CR) 10CI reject: [V: 04-1] kubernetes::master: Add confd config writing all sa certs [puppet] - 10https://gerrit.wikimedia.org/r/939630 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [08:26:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1180 (re)pooling @ 25%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49593 and previous config saved to /var/cache/conftool/dbconfig/20230719-082651-root.json [08:27:25] (03CR) 10CI reject: [V: 04-1] kubernetes::master: Add confd config writing all sa certs [puppet] - 10https://gerrit.wikimedia.org/r/939630 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [08:28:48] (03Merged) 10jenkins-bot: Use the LinksUpdate::isRecursive flag again to route cirrusSearchLinksUpdate [extensions/CirrusSearch] (wmf/1.41.0-wmf.17) - 10https://gerrit.wikimedia.org/r/939327 (owner: 10DCausse) [08:29:14] !log dcausse@deploy1002 Started scap: Backport for [[gerrit:939327|Use the LinksUpdate::isRecursive flag again to route cirrusSearchLinksUpdate]] [08:30:03] (03PS12) 10JMeybohm: confd: allow running multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/937573 (https://phabricator.wikimedia.org/T341669) (owner: 10Giuseppe Lavagetto) [08:30:05] (03PS3) 10JMeybohm: kubernetes::master: Add confd config writing all sa certs [puppet] - 10https://gerrit.wikimedia.org/r/939630 (https://phabricator.wikimedia.org/T329826) [08:30:12] (03CR) 10Filippo Giunchedi: [C: 03+1] service::catalog: add prometheus-https [puppet] - 10https://gerrit.wikimedia.org/r/939326 (https://phabricator.wikimedia.org/T301944) (owner: 10Herron) [08:30:47] !log dcausse@deploy1002 dcausse: Backport for [[gerrit:939327|Use the LinksUpdate::isRecursive flag again to route cirrusSearchLinksUpdate]] synced to the testservers mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [08:32:07] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42548/console" [puppet] - 10https://gerrit.wikimedia.org/r/937573 (https://phabricator.wikimedia.org/T341669) (owner: 10Giuseppe Lavagetto) [08:32:52] 10SRE, 10ops-eqiad, 10Data-Platform-SRE, 10Infrastructure-Foundations: analytics1073 and analytics1075 - loss of connectivity - https://phabricator.wikimedia.org/T342141 (10BTullis) >>! In T342141#9026115, @Papaul wrote: > @BTullis we had the same issue with sessionstore2001 in codw see task below what we... [08:33:05] (03PS4) 10JMeybohm: kubernetes::master: Add confd config writing all sa certs [puppet] - 10https://gerrit.wikimedia.org/r/939630 (https://phabricator.wikimedia.org/T329826) [08:33:14] 10SRE-swift-storage, 10Observability-Metrics, 10SRE Observability (FY2023/2024-Q1), 10User-fgiunchedi: Create 'titan' role and put new hosts in service - https://phabricator.wikimedia.org/T341999 (10fgiunchedi) p:05Medium→03High Sure why not, {{done}} [08:33:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2158 (re)pooling @ 50%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49594 and previous config saved to /var/cache/conftool/dbconfig/20230719-083319-root.json [08:33:22] (03CR) 10CI reject: [V: 04-1] kubernetes::master: Add confd config writing all sa certs [puppet] - 10https://gerrit.wikimedia.org/r/939630 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [08:35:56] (03CR) 10CI reject: [V: 04-1] kubernetes::master: Add confd config writing all sa certs [puppet] - 10https://gerrit.wikimedia.org/r/939630 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [08:36:33] (03PS1) 10Elukey: ml-services: update Docker image for ores-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/939633 (https://phabricator.wikimedia.org/T341479) [08:37:01] (03PS13) 10JMeybohm: confd: allow running multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/937573 (https://phabricator.wikimedia.org/T341669) (owner: 10Giuseppe Lavagetto) [08:37:03] (03PS5) 10JMeybohm: kubernetes::master: Add confd config writing all sa certs [puppet] - 10https://gerrit.wikimedia.org/r/939630 (https://phabricator.wikimedia.org/T329826) [08:37:13] !log dcausse@deploy1002 Finished scap: Backport for [[gerrit:939327|Use the LinksUpdate::isRecursive flag again to route cirrusSearchLinksUpdate]] (duration: 07m 59s) [08:38:12] !log closing the UTC morning backport window [08:38:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:30] (03CR) 10Elukey: [C: 03+2] ml-services: update Docker image for ores-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/939633 (https://phabricator.wikimedia.org/T341479) (owner: 10Elukey) [08:39:34] (03PS11) 10Jbond: ssh: switch to using the same file we use in production [puppet] - 10https://gerrit.wikimedia.org/r/936692 (https://phabricator.wikimedia.org/T340947) [08:40:04] (03CR) 10Jbond: ssh: switch to using the same file we use in production (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/936692 (https://phabricator.wikimedia.org/T340947) (owner: 10Jbond) [08:41:06] (03PS14) 10JMeybohm: confd: allow running multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/937573 (https://phabricator.wikimedia.org/T341669) (owner: 10Giuseppe Lavagetto) [08:41:08] (03PS6) 10JMeybohm: kubernetes::master: Add confd config writing all sa certs [puppet] - 10https://gerrit.wikimedia.org/r/939630 (https://phabricator.wikimedia.org/T329826) [08:41:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1180 (re)pooling @ 50%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49595 and previous config saved to /var/cache/conftool/dbconfig/20230719-084156-root.json [08:42:16] 10SRE, 10Traffic, 10Incident Tooling: ncredir redirects for status.wiki* --> status.wikimedia.org - https://phabricator.wikimedia.org/T318804 (10Vgutierrez) >>! In T318804#8639175, @BCornwall wrote: > Looking into it further, it seems this is a very possible change! nginx mappings/site names support wildcard... [08:42:52] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42552/console" [puppet] - 10https://gerrit.wikimedia.org/r/939630 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [08:44:11] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42553/console" [puppet] - 10https://gerrit.wikimedia.org/r/937573 (https://phabricator.wikimedia.org/T341669) (owner: 10Giuseppe Lavagetto) [08:45:19] !log elukey@deploy1002 helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . [08:45:50] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/939355 (https://phabricator.wikimedia.org/T341334) (owner: 10Jelto) [08:46:07] (03CR) 10JMeybohm: [V: 03+1] kubernetes::master: Add confd config writing all sa certs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/939630 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [08:47:46] (03CR) 10JMeybohm: [V: 03+1] "I've fixed two issues that where uncovered when writing the followup patch. Please double check when you have a minute." [puppet] - 10https://gerrit.wikimedia.org/r/937573 (https://phabricator.wikimedia.org/T341669) (owner: 10Giuseppe Lavagetto) [08:48:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49596 and previous config saved to /var/cache/conftool/dbconfig/20230719-084823-root.json [08:49:47] (03PS16) 10JMeybohm: Update ipoid to certmanager [deployment-charts] - 10https://gerrit.wikimedia.org/r/937959 (https://phabricator.wikimedia.org/T300033) [08:51:05] (03CR) 10JMeybohm: [C: 03+2] Update ipoid to certmanager [deployment-charts] - 10https://gerrit.wikimedia.org/r/937959 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [08:51:36] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/939377 (https://phabricator.wikimedia.org/T342182) (owner: 10BCornwall) [08:51:55] (03Merged) 10jenkins-bot: Update ipoid to certmanager [deployment-charts] - 10https://gerrit.wikimedia.org/r/937959 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [08:54:49] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/939381 (https://phabricator.wikimedia.org/T342182) (owner: 10BCornwall) [08:56:39] RECOVERY - mediawiki-installation DSH group on parse1002 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [08:57:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1180 (re)pooling @ 75%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49597 and previous config saved to /var/cache/conftool/dbconfig/20230719-085700-root.json [09:03:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49598 and previous config saved to /var/cache/conftool/dbconfig/20230719-090328-root.json [09:12:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1180 (re)pooling @ 100%: Repooling after maintenance', diff saved to https://phabricator.wikimedia.org/P49599 and previous config saved to /var/cache/conftool/dbconfig/20230719-091205-root.json [09:14:08] !log btullis@deploy1002 Started deploy [airflow-dags/analytics_test@be05071]: (no justification provided) [09:14:12] !log btullis@deploy1002 Finished deploy [airflow-dags/analytics_test@be05071]: (no justification provided) (duration: 00m 04s) [09:20:38] (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:22:24] 10SRE, 10ops-eqiad, 10Traffic: Relocate lvs1013-lvs1016 to rows E & F - https://phabricator.wikimedia.org/T341992 (10Vgutierrez) @RobH I'm seeing on cumin1001 logs, that you interrupted the reimage of lvs1013 by pressing Ctrl+C: ` 2023-07-18 16:01:28,549 robh 2034852 [INFO] Completed command '/usr/local/sbin... [09:25:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (GET pods) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:32:35] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: Unrelated DNS diffs shown if decommission and makevm cookbooks run at the same time - https://phabricator.wikimedia.org/T342130 (10jbond) >>! In T342130#9024276, @bking wrote: > Was thinking a bit more about this...would it work to do some minimal sanit... [09:32:54] (03PS1) 10JMeybohm: deployment_server::global_config: Use symlinks for cluster aliases [puppet] - 10https://gerrit.wikimedia.org/r/939636 (https://phabricator.wikimedia.org/T300033) [09:33:57] (03PS1) 10Ilias Sarantopoulos: ml-services: remove gpu from nllb model [deployment-charts] - 10https://gerrit.wikimedia.org/r/939637 [09:35:11] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42554/console" [puppet] - 10https://gerrit.wikimedia.org/r/939636 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [09:36:18] (03CR) 10Elukey: [C: 03+1] ml-services: remove gpu from nllb model [deployment-charts] - 10https://gerrit.wikimedia.org/r/939637 (owner: 10Ilias Sarantopoulos) [09:37:04] (03CR) 10Klausman: [C: 03+1] ml-services: remove gpu from nllb model [deployment-charts] - 10https://gerrit.wikimedia.org/r/939637 (owner: 10Ilias Sarantopoulos) [09:38:27] (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: remove gpu from nllb model [deployment-charts] - 10https://gerrit.wikimedia.org/r/939637 (owner: 10Ilias Sarantopoulos) [09:38:39] (03CR) 10Jbond: [V: 03+1 C: 03+2] puppetserver: use FQDN in metric [puppet] - 10https://gerrit.wikimedia.org/r/939362 (https://phabricator.wikimedia.org/T342125) (owner: 10Jbond) [09:38:50] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] deployment_server::global_config: Use symlinks for cluster aliases [puppet] - 10https://gerrit.wikimedia.org/r/939636 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [09:39:13] (03Merged) 10jenkins-bot: ml-services: remove gpu from nllb model [deployment-charts] - 10https://gerrit.wikimedia.org/r/939637 (owner: 10Ilias Sarantopoulos) [09:39:24] jayme: happy for me to merge you rcr [09:39:32] jbond: yes please [09:39:46] done [09:39:51] thanks [09:43:28] !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'experimental' for release 'main' . [09:47:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [09:48:06] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/ipoid: apply [09:50:48] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/ipoid: apply [09:52:37] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [09:54:15] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/ipoid: apply [09:54:23] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/ipoid: apply [09:55:53] (03PS1) 10Elukey: ml-services: bump Docker image for ores-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/939640 (https://phabricator.wikimedia.org/T341479) [09:58:27] (03CR) 10Ilias Sarantopoulos: [C: 03+1] ml-services: bump Docker image for ores-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/939640 (https://phabricator.wikimedia.org/T341479) (owner: 10Elukey) [09:58:42] (03CR) 10Elukey: [C: 03+2] ml-services: bump Docker image for ores-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/939640 (https://phabricator.wikimedia.org/T341479) (owner: 10Elukey) [10:00:04] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230719T1000) [10:02:07] !log elukey@deploy1002 helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . [10:04:12] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): spicerack: update spicerack to work with the newer puppet infrastructure - https://phabricator.wikimedia.org/T341496 (10jbond) [10:06:16] 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T342197 (10phaultfinder) [10:11:07] (03PS1) 10Jbond: puppetserver: do not notify puppetserver service on changes [puppet] - 10https://gerrit.wikimedia.org/r/939643 (https://phabricator.wikimedia.org/T330490) [10:13:27] (03PS2) 10Jbond: puppetserver: do not notify puppetserver service on changes [puppet] - 10https://gerrit.wikimedia.org/r/939643 (https://phabricator.wikimedia.org/T330490) [10:13:34] (03PS1) 10Btullis: Failover hive services to standby server [dns] - 10https://gerrit.wikimedia.org/r/939644 (https://phabricator.wikimedia.org/T329716) [10:13:53] 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T342197 (10phaultfinder) [10:15:45] (03CR) 10Btullis: [C: 03+2] Failover hive services to standby server [dns] - 10https://gerrit.wikimedia.org/r/939644 (https://phabricator.wikimedia.org/T329716) (owner: 10Btullis) [10:16:18] (03PS3) 10Jbond: puppetserver: do not notify puppetserver service on changes [puppet] - 10https://gerrit.wikimedia.org/r/939643 (https://phabricator.wikimedia.org/T330490) [10:17:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [10:18:24] (SystemdUnitFailed) firing: (2) load-categories-daily.service Failed on wdqs2021:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:19:38] (03PS4) 10Jbond: puppetserver: do not notify puppetserver service on changes [puppet] - 10https://gerrit.wikimedia.org/r/939643 (https://phabricator.wikimedia.org/T330490) [10:22:57] (03CR) 10Klausman: "This change is ready for review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/939647 (owner: 10Klausman) [10:23:01] (03PS4) 10Urbanecm: IP Masking: Enable for cswiki beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938915 (https://phabricator.wikimedia.org/T342034) [10:23:59] (03PS5) 10Urbanecm: IP Masking: Enable for cswiki beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938915 (https://phabricator.wikimedia.org/T342034) [10:26:15] (03PS6) 10Urbanecm: IP Masking: Enable for cswiki beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938915 (https://phabricator.wikimedia.org/T342034) [10:26:52] (03CR) 10Urbanecm: IP Masking: Enable for cswiki beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938915 (https://phabricator.wikimedia.org/T342034) (owner: 10Urbanecm) [10:27:50] (03CR) 10Urbanecm: IP Masking: Enable for cswiki beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938915 (https://phabricator.wikimedia.org/T342034) (owner: 10Urbanecm) [10:29:20] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Infrastructure, and 2 others: update systems to use new puppetdb instance - https://phabricator.wikimedia.org/T342214 (10jbond) [10:29:31] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Infrastructure, and 2 others: update systems to use new puppetdb instance - https://phabricator.wikimedia.org/T342214 (10jbond) 05Open→03In progress p:05Triage→03Medium [10:29:38] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10jbond) [10:30:42] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10jbond) [10:30:55] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): puppetserver monitoring - https://phabricator.wikimedia.org/T342125 (10jbond) 05In progress→03Stalled this is now stalled until we move the old puppetmasteres to the new puppetdb instance [10:43:22] (03PS1) 10Gmodena: data-engineering: flink: alert based on active site [alerts] - 10https://gerrit.wikimedia.org/r/939651 [10:44:10] (03PS1) 10Jbond: puppetboard: update main site to support service -next [puppet] - 10https://gerrit.wikimedia.org/r/939652 (https://phabricator.wikimedia.org/T342125) [10:44:33] (03PS3) 10EoghanGaffney: Remove references to releases1002/releases2002 for decom [puppet] - 10https://gerrit.wikimedia.org/r/938889 (https://phabricator.wikimedia.org/T334435) [10:47:02] (03PS1) 10Btullis: Install MariaDB to db1208 [puppet] - 10https://gerrit.wikimedia.org/r/939653 (https://phabricator.wikimedia.org/T334055) [10:47:04] (03PS1) 10Btullis: Switch references from db1108 to db1208 [puppet] - 10https://gerrit.wikimedia.org/r/939654 (https://phabricator.wikimedia.org/T334055) [10:48:25] (03PS2) 10Jbond: puppetboard: update main site to support service -next [puppet] - 10https://gerrit.wikimedia.org/r/939652 (https://phabricator.wikimedia.org/T342125) [10:48:27] (03PS1) 10Jbond: services: swap puppetboard and puppetboard-next [puppet] - 10https://gerrit.wikimedia.org/r/939655 (https://phabricator.wikimedia.org/T342214) [10:49:21] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42556/console" [puppet] - 10https://gerrit.wikimedia.org/r/939653 (https://phabricator.wikimedia.org/T334055) (owner: 10Btullis) [10:49:33] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42557/console" [puppet] - 10https://gerrit.wikimedia.org/r/939655 (https://phabricator.wikimedia.org/T342214) (owner: 10Jbond) [10:50:40] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42558/console" [puppet] - 10https://gerrit.wikimedia.org/r/939655 (https://phabricator.wikimedia.org/T342214) (owner: 10Jbond) [10:53:09] (03PS1) 10Milimetric: rest-gateway: add route for metrics/knowledge-gap [deployment-charts] - 10https://gerrit.wikimedia.org/r/939656 (https://phabricator.wikimedia.org/T342213) [10:55:16] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:55:38] (03PS2) 10Jbond: services: swap puppetboard and puppetboard-next [puppet] - 10https://gerrit.wikimedia.org/r/939655 (https://phabricator.wikimedia.org/T342214) [10:58:51] (03CR) 10Marostegui: [C: 04-1] "db1208 doesn't have data yet. It first needs to be recloned." [puppet] - 10https://gerrit.wikimedia.org/r/939654 (https://phabricator.wikimedia.org/T334055) (owner: 10Btullis) [10:59:07] !log jebe@deploy1002 Started deploy [analytics/refinery@eaabff2]: Regular analytics weekly train [analytics/refinery@eaabff2] [11:00:16] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [11:03:24] (03PS3) 10Jbond: puppetboard: update main site to support service -next [puppet] - 10https://gerrit.wikimedia.org/r/939652 (https://phabricator.wikimedia.org/T342125) [11:03:26] (03PS3) 10Jbond: services: swap puppetboard and puppetboard-next [puppet] - 10https://gerrit.wikimedia.org/r/939655 (https://phabricator.wikimedia.org/T342214) [11:04:24] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42560/console" [puppet] - 10https://gerrit.wikimedia.org/r/939652 (https://phabricator.wikimedia.org/T342125) (owner: 10Jbond) [11:07:01] (03PS4) 10Jbond: puppetboard: update main site to support service -next [puppet] - 10https://gerrit.wikimedia.org/r/939652 (https://phabricator.wikimedia.org/T342125) [11:07:03] (03PS4) 10Jbond: services: swap puppetboard and puppetboard-next [puppet] - 10https://gerrit.wikimedia.org/r/939655 (https://phabricator.wikimedia.org/T342214) [11:08:09] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42561/console" [puppet] - 10https://gerrit.wikimedia.org/r/939652 (https://phabricator.wikimedia.org/T342125) (owner: 10Jbond) [11:09:32] !log jebe@deploy1002 Finished deploy [analytics/refinery@eaabff2]: Regular analytics weekly train [analytics/refinery@eaabff2] (duration: 10m 24s) [11:11:35] (03PS5) 10Jbond: puppetboard: update main site to support service -next [puppet] - 10https://gerrit.wikimedia.org/r/939652 (https://phabricator.wikimedia.org/T342125) [11:11:37] (03PS5) 10Jbond: services: swap puppetboard and puppetboard-next [puppet] - 10https://gerrit.wikimedia.org/r/939655 (https://phabricator.wikimedia.org/T342214) [11:11:38] !log jebe@deploy1002 Started deploy [analytics/refinery@eaabff2] (thin): Regular analytics weekly train THIN [analytics/refinery@eaabff2] [11:11:42] !log jebe@deploy1002 Finished deploy [analytics/refinery@eaabff2] (thin): Regular analytics weekly train THIN [analytics/refinery@eaabff2] (duration: 00m 04s) [11:12:04] !log jebe@deploy1002 Started deploy [analytics/refinery@eaabff2] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@eaabff2] [11:12:39] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42562/console" [puppet] - 10https://gerrit.wikimedia.org/r/939652 (https://phabricator.wikimedia.org/T342125) (owner: 10Jbond) [11:13:47] !log jebe@deploy1002 Finished deploy [analytics/refinery@eaabff2] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@eaabff2] (duration: 01m 43s) [11:14:27] (03CR) 10Jbond: [V: 03+1 C: 03+2] puppetboard: update main site to support service -next [puppet] - 10https://gerrit.wikimedia.org/r/939652 (https://phabricator.wikimedia.org/T342125) (owner: 10Jbond) [11:16:34] (03PS1) 10Fabfur: haproxy: Add option to disable keepalive on port 80 [puppet] - 10https://gerrit.wikimedia.org/r/939661 (https://phabricator.wikimedia.org/T342211) [11:26:43] (03PS1) 10Arturo Borrero Gonzalez: acme_chief: ldap-codfw1dev: include private FQDNs [puppet] - 10https://gerrit.wikimedia.org/r/939662 (https://phabricator.wikimedia.org/T342185) [11:28:00] (03PS1) 10Ssingh: durum: bind anycast-healthchecker.service to nginx.service [puppet] - 10https://gerrit.wikimedia.org/r/939663 [11:28:59] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42563/console" [puppet] - 10https://gerrit.wikimedia.org/r/939663 (owner: 10Ssingh) [11:31:28] (03PS2) 10Fabfur: haproxy: Add option to disable keepalive on port 80 [puppet] - 10https://gerrit.wikimedia.org/r/939661 (https://phabricator.wikimedia.org/T342211) [11:31:30] (Traffic bill over quota) firing: Alert for device cr2-drmrs.wikimedia.org - Traffic bill over quota got acknowledged - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [11:33:40] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (DIFF 3 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42564/console" [puppet] - 10https://gerrit.wikimedia.org/r/939661 (https://phabricator.wikimedia.org/T342211) (owner: 10Fabfur) [11:33:44] (03PS6) 10Jbond: services: swap puppetboard and puppetboard-next [puppet] - 10https://gerrit.wikimedia.org/r/939655 (https://phabricator.wikimedia.org/T342214) [11:33:46] (03PS1) 10Jbond: puppetboard: create a new site for puppetboard-next [puppet] - 10https://gerrit.wikimedia.org/r/939665 (https://phabricator.wikimedia.org/T342125) [11:34:47] (03Abandoned) 10Btullis: Add the refinery-cache/revs directory to git safe list [puppet] - 10https://gerrit.wikimedia.org/r/922905 (https://phabricator.wikimedia.org/T334493) (owner: 10Stevemunene) [11:34:53] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42565/console" [puppet] - 10https://gerrit.wikimedia.org/r/939655 (https://phabricator.wikimedia.org/T342214) (owner: 10Jbond) [11:36:11] (03CR) 10Marostegui: [C: 03+1] Install MariaDB to db1208 [puppet] - 10https://gerrit.wikimedia.org/r/939653 (https://phabricator.wikimedia.org/T334055) (owner: 10Btullis) [11:36:14] (03CR) 10Btullis: Add the refinery-cache/revs directory to git safe list (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/922905 (https://phabricator.wikimedia.org/T334493) (owner: 10Stevemunene) [11:38:07] (03PS1) 10Jennifer Ebe: Update refine jobs with new var version [puppet] - 10https://gerrit.wikimedia.org/r/939667 [11:38:25] 10SRE, 10ops-eqiad, 10DC-Ops, 10User-aborrero, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudservices1006.eqiad.wmnet - https://phabricator.wikimedia.org/T342161 (10aborrero) [11:39:45] 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2022/2023-Q4): eqiad1: cloudlb: reimage cloudcontrol1005 into new network setup - https://phabricator.wikimedia.org/T341495 (10aborrero) hey @Jclark-ctr if you have more than one cloud-related tasks to do on-site, please give highest priority to... [11:40:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.decommission for hosts dbproxy1016.eqiad.wmnet [11:45:08] (03PS2) 10Ladsgroup: realm: Add two new private tables of CheckUser [puppet] - 10https://gerrit.wikimedia.org/r/938841 (https://phabricator.wikimedia.org/T341076) [11:45:12] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] realm: Add two new private tables of CheckUser [puppet] - 10https://gerrit.wikimedia.org/r/938841 (https://phabricator.wikimedia.org/T341076) (owner: 10Ladsgroup) [11:45:22] !log ladsgroup@cumin1001 START - Cookbook sre.dns.netbox [11:45:44] (03CR) 10Jbond: [V: 03+1 C: 03+2] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42566/console" [puppet] - 10https://gerrit.wikimedia.org/r/939652 (https://phabricator.wikimedia.org/T342125) (owner: 10Jbond) [11:45:53] (03CR) 10Majavah: [C: 04-1] "acme-chief/LE can't issue certificates for .wmnet names" [puppet] - 10https://gerrit.wikimedia.org/r/939662 (https://phabricator.wikimedia.org/T342185) (owner: 10Arturo Borrero Gonzalez) [11:47:34] !log ladsgroup@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dbproxy1016.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - ladsgroup@cumin1001" [11:48:03] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42567/console" [puppet] - 10https://gerrit.wikimedia.org/r/939665 (https://phabricator.wikimedia.org/T342125) (owner: 10Jbond) [11:48:14] (03CR) 10Arturo Borrero Gonzalez: acme_chief: ldap-codfw1dev: include private FQDNs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/939662 (https://phabricator.wikimedia.org/T342185) (owner: 10Arturo Borrero Gonzalez) [11:48:19] (03CR) 10Jbond: [V: 03+1 C: 03+2] puppetboard: create a new site for puppetboard-next [puppet] - 10https://gerrit.wikimedia.org/r/939665 (https://phabricator.wikimedia.org/T342125) (owner: 10Jbond) [11:48:55] (03PS1) 10Ayounsi: Fix some pylint errors [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/939669 [11:49:28] (03CR) 10CI reject: [V: 04-1] Fix some pylint errors [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/939669 (owner: 10Ayounsi) [11:50:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: dbproxy1016.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - ladsgroup@cumin1001" [11:50:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:50:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts dbproxy1016.eqiad.wmnet [11:51:30] (Traffic bill over quota) resolved: Alert for device cr2-drmrs.wikimedia.org - Traffic bill over quota got acknowledged - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [11:53:29] PROBLEM - puppetboard-next.wikimedia.org requires authentication on puppetboard1003 is CRITICAL: HTTP CRITICAL: Status line output matched HTTP/1.1 302 - header ocation: https://idp.wikim... not found on https://puppetboard-next.wikimedia.org:443/ - 574 bytes in 0.048 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [11:54:04] (03CR) 10Btullis: [C: 03+2] Update refine jobs with new var version [puppet] - 10https://gerrit.wikimedia.org/r/939667 (owner: 10Jennifer Ebe) [11:54:40] (03PS1) 10Jbond: puppetboard: should point to production and not use saml [puppet] - 10https://gerrit.wikimedia.org/r/939670 [11:55:29] (03CR) 10Jbond: [C: 03+2] puppetboard: should point to production and not use saml [puppet] - 10https://gerrit.wikimedia.org/r/939670 (owner: 10Jbond) [11:56:19] RECOVERY - Check systemd state on stat1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:01:56] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for cyndywikime - https://phabricator.wikimedia.org/T342230 (10Cyndymediawiksim) [12:02:04] (03PS1) 10Jbond: puppetboard: add port [puppet] - 10https://gerrit.wikimedia.org/r/939672 [12:02:18] PROBLEM - MariaDB Replica SQL: analytics_meta on db1108 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:02:40] PROBLEM - MariaDB Replica IO: matomo on db1108 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:03:12] (03CR) 10Btullis: [V: 03+1 C: 03+2] Install MariaDB to db1208 [puppet] - 10https://gerrit.wikimedia.org/r/939653 (https://phabricator.wikimedia.org/T334055) (owner: 10Btullis) [12:04:38] PROBLEM - MariaDB read only analytics_meta on db1108 is CRITICAL: Could not connect to localhost:3352 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [12:04:38] PROBLEM - MariaDB read only matomo on db1108 is CRITICAL: Could not connect to localhost:3351 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [12:04:52] PROBLEM - mysqld processes on db1108 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [12:04:56] (03CR) 10Jbond: [C: 03+2] puppetboard: add port [puppet] - 10https://gerrit.wikimedia.org/r/939672 (owner: 10Jbond) [12:05:06] PROBLEM - MariaDB Replica SQL: matomo on db1108 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:05:38] (03CR) 10Elukey: [C: 03+1] "root@deploy1002:/home/elukey# istioctl-1.15.7 manifest diff /srv/deployment-charts/custom_deploy.d/istio/ml-serve/config.yaml config.yaml" [deployment-charts] - 10https://gerrit.wikimedia.org/r/939647 (owner: 10Klausman) [12:06:17] (03CR) 10Elukey: "Sorry just seen, could you apply the same change for the gateway services instance?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/939647 (owner: 10Klausman) [12:06:44] ACKNOWLEDGEMENT - MariaDB Replica IO: matomo on db1108 is CRITICAL: CRITICAL slave_io_state could not connect Btullis T334055 - migratnig to db1208 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:06:44] ACKNOWLEDGEMENT - MariaDB Replica Lag: analytics_meta on db1108 is CRITICAL: CRITICAL slave_sql_lag could not connect Btullis T334055 - migratnig to db1208 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:06:44] ACKNOWLEDGEMENT - MariaDB Replica Lag: matomo on db1108 is CRITICAL: CRITICAL slave_sql_lag could not connect Btullis T334055 - migratnig to db1208 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:06:44] ACKNOWLEDGEMENT - MariaDB Replica SQL: analytics_meta on db1108 is CRITICAL: CRITICAL slave_sql_state could not connect Btullis T334055 - migratnig to db1208 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:06:44] ACKNOWLEDGEMENT - MariaDB Replica SQL: matomo on db1108 is CRITICAL: CRITICAL slave_sql_state could not connect Btullis T334055 - migratnig to db1208 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:06:44] ACKNOWLEDGEMENT - MariaDB read only analytics_meta on db1108 is CRITICAL: Could not connect to localhost:3352 Btullis T334055 - migratnig to db1208 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [12:06:44] ACKNOWLEDGEMENT - MariaDB read only matomo on db1108 is CRITICAL: Could not connect to localhost:3351 Btullis T334055 - migratnig to db1208 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [12:06:45] ACKNOWLEDGEMENT - mysqld processes on db1108 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld Btullis T334055 - migratnig to db1208 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [12:07:51] (03PS2) 10Elukey: ml-services/istio: Increase memory quota to 1.5Gi [deployment-charts] - 10https://gerrit.wikimedia.org/r/939647 (owner: 10Klausman) [12:08:26] (03CR) 10CI reject: [V: 04-1] ml-services/istio: Increase memory quota to 1.5Gi [deployment-charts] - 10https://gerrit.wikimedia.org/r/939647 (owner: 10Klausman) [12:10:24] (03PS3) 10Elukey: ml-services/istio: Increase memory quota to 1.5Gi [deployment-charts] - 10https://gerrit.wikimedia.org/r/939647 (owner: 10Klausman) [12:11:42] (03CR) 10Jbond: [V: 03+1 C: 03+2] services: swap puppetboard and puppetboard-next [puppet] - 10https://gerrit.wikimedia.org/r/939655 (https://phabricator.wikimedia.org/T342214) (owner: 10Jbond) [12:11:50] (03PS7) 10Jbond: services: swap puppetboard and puppetboard-next [puppet] - 10https://gerrit.wikimedia.org/r/939655 (https://phabricator.wikimedia.org/T342214) [12:12:20] RECOVERY - puppetboard-next.wikimedia.org requires authentication on puppetboard1003 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 564 bytes in 0.050 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [12:15:14] RECOVERY - Host lsw1-f2-eqiad.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.75 ms [12:15:18] RECOVERY - Host ps1-f2-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.43 ms [12:16:15] (03CR) 10Elukey: [C: 03+2] "root@deploy1002:/home/elukey# istioctl-1.15.7 manifest diff /srv/deployment-charts/custom_deploy.d/istio/ml-serve/config.yaml config.yaml" [deployment-charts] - 10https://gerrit.wikimedia.org/r/939647 (owner: 10Klausman) [12:17:03] !log jbond@cumin1001 START - Cookbook sre.dns.wipe-cache puppetboard-next.discovery.wmnet on all recursors [12:17:06] !log jbond@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) puppetboard-next.discovery.wmnet on all recursors [12:17:12] (03CR) 10Vgutierrez: [C: 04-1] "small fix needed, looking good though" [puppet] - 10https://gerrit.wikimedia.org/r/939661 (https://phabricator.wikimedia.org/T342211) (owner: 10Fabfur) [12:17:21] !log jbond@cumin1001 START - Cookbook sre.dns.wipe-cache puppetboard.discovery.wmnet on all recursors [12:17:22] (03PS1) 10ArielGlenn: Make sure that rsync runs only on the primary dumps nfs shares [puppet] - 10https://gerrit.wikimedia.org/r/939674 (https://phabricator.wikimedia.org/T325232) [12:17:24] !log jbond@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) puppetboard.discovery.wmnet on all recursors [12:18:22] 10ops-eqiad, 10DBA, 10decommission-hardware: decommission dbproxy1016.eqiad.wmnet - https://phabricator.wikimedia.org/T342224 (10Ladsgroup) a:05Ladsgroup→03wiki_willy [12:19:36] (03PS1) 10Jbond: puppetboard: add -next domain to tls certs [puppet] - 10https://gerrit.wikimedia.org/r/939675 (https://phabricator.wikimedia.org/T342214) [12:20:19] (03CR) 10Jbond: [C: 03+2] puppetboard: add -next domain to tls certs [puppet] - 10https://gerrit.wikimedia.org/r/939675 (https://phabricator.wikimedia.org/T342214) (owner: 10Jbond) [12:22:24] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/ipoid: apply [12:22:28] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/ipoid: apply [12:22:33] !log switch puppertboard.wikimedia.oreg to use puppet7 infrastructre [12:22:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:54] (03PS3) 10Fabfur: haproxy: Add option to disable keepalive on port 80 [puppet] - 10https://gerrit.wikimedia.org/r/939661 (https://phabricator.wikimedia.org/T342211) [12:23:10] (03CR) 10Vgutierrez: [C: 04-2] "as stated already by Majavah, Let's Encrypt can't validate SNIs for private domains (non-reachable from the Internet). Please take into ac" [puppet] - 10https://gerrit.wikimedia.org/r/939662 (https://phabricator.wikimedia.org/T342185) (owner: 10Arturo Borrero Gonzalez) [12:24:48] (03CR) 10Fabfur: haproxy: Add option to disable keepalive on port 80 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/939661 (https://phabricator.wikimedia.org/T342211) (owner: 10Fabfur) [12:25:41] 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T342197 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr powercycled switch link icinga errors have cleared [12:26:54] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for cyndywikime - https://phabricator.wikimedia.org/T342230 (10Aklapper) Hi, as a side note, there is currently no connection which could be verified between that staff account on wikitech (which uses a `@wikimedia.org` email address) and the Phabricator accou... [12:27:22] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/services/ipoid: apply [12:27:57] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/services/ipoid: apply [12:29:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:32:47] (03Abandoned) 10Arturo Borrero Gonzalez: acme_chief: ldap-codfw1dev: include private FQDNs [puppet] - 10https://gerrit.wikimedia.org/r/939662 (https://phabricator.wikimedia.org/T342185) (owner: 10Arturo Borrero Gonzalez) [12:34:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:35:44] (03PS2) 10Ayounsi: Fix some pylint errors [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/939669 [12:36:20] (03PS1) 10Jbond: puppetdb-api-next: add new discovery record for testing puppetdb-api [dns] - 10https://gerrit.wikimedia.org/r/939678 (https://phabricator.wikimedia.org/T342214) [12:40:10] (03PS1) 10Jbond: puppetdb-api-next: Add new puppetdb-api discovery record [puppet] - 10https://gerrit.wikimedia.org/r/939679 (https://phabricator.wikimedia.org/T342214) [12:43:06] !log joal@deploy1002 Started deploy [airflow-dags/analytics@87be328]: Refactor cassandra loading jobs [12:43:20] !log joal@deploy1002 Finished deploy [airflow-dags/analytics@87be328]: Refactor cassandra loading jobs (duration: 00m 14s) [12:46:41] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for cyndywikime - https://phabricator.wikimedia.org/T342230 (10Cyndymediawiksim) >>! In T342230#9027650, @Aklapper wrote: > Hi, as a side note, there is currently no connection which could be verified between that staff account on wikitech (which uses a `@wiki... [12:46:52] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Cyndymediawiksim - https://phabricator.wikimedia.org/T342230 (10Cyndymediawiksim) [12:48:44] (03PS1) 10Ayounsi: WIP: first scaffolding fo gNMI support [software/homer] - 10https://gerrit.wikimedia.org/r/939681 [12:50:31] (03CR) 10CI reject: [V: 04-1] WIP: first scaffolding fo gNMI support [software/homer] - 10https://gerrit.wikimedia.org/r/939681 (owner: 10Ayounsi) [12:58:20] (03CR) 10Ladsgroup: [C: 03+1] "LGTM, should we try this in mwdebug somehow?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938644 (https://phabricator.wikimedia.org/T341859) (owner: 10Giuseppe Lavagetto) [12:58:34] (03CR) 10Vgutierrez: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/939661 (https://phabricator.wikimedia.org/T342211) (owner: 10Fabfur) [12:58:38] (03CR) 10Ladsgroup: [C: 03+1] "Nice" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938645 (https://phabricator.wikimedia.org/T341859) (owner: 10Giuseppe Lavagetto) [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230719T1300). [13:00:05] subbu and cscott: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:14] o/ [13:00:22] I can deploy today! [13:00:41] o/ [13:01:21] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Fix incorrect use of UseLegacyMediaStyles (missing "wg" prefix) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939374 (https://phabricator.wikimedia.org/T318433) (owner: 10Subramanya Sastry) [13:01:25] (03PS2) 10Lucas Werkmeister (WMDE): Fix incorrect use of UseLegacyMediaStyles (missing "wg" prefix) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939374 (https://phabricator.wikimedia.org/T318433) (owner: 10Subramanya Sastry) [13:01:25] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939374 (https://phabricator.wikimedia.org/T318433) (owner: 10Subramanya Sastry) [13:01:49] Lucas_WMDE: I am adding one patch [13:01:51] 10SRE, 10ops-eqiad, 10Data-Platform-SRE, 10Infrastructure-Foundations: analytics1073 and analytics1075 - loss of connectivity - https://phabricator.wikimedia.org/T342141 (10Papaul) @BTullis yes that is a possibility too to use the 10G nic since those 2 nodes each has 4x1G nic and 2x10G nic. There are 2 way... [13:01:53] ok [13:02:13] * Lucas_WMDE wonders if IS.php should have linting against array keys not starting with wg* or wmg* [13:02:23] (03Merged) 10jenkins-bot: Fix incorrect use of UseLegacyMediaStyles (missing "wg" prefix) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939374 (https://phabricator.wikimedia.org/T318433) (owner: 10Subramanya Sastry) [13:02:52] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:939374|Fix incorrect use of UseLegacyMediaStyles (missing "wg" prefix) (T318433)]] [13:02:56] T318433: Templates (and extensions) that mimic parser media output need migration to new structure - https://phabricator.wikimedia.org/T318433 [13:04:27] !log lucaswerkmeister-wmde@deploy1002 ssastry and lucaswerkmeister-wmde: Backport for [[gerrit:939374|Fix incorrect use of UseLegacyMediaStyles (missing "wg" prefix) (T318433)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [13:04:42] subbu: can you test the parsoid styles on mwdebug? [13:04:47] yes, testing. [13:06:47] 10SRE, 10ops-eqiad, 10Data-Platform-SRE, 10Infrastructure-Foundations: analytics1073 and analytics1075 - loss of connectivity - https://phabricator.wikimedia.org/T342141 (10Jclark-ctr) a:03Jclark-ctr [13:07:17] 10SRE, 10ops-eqiad, 10Data-Platform-SRE, 10Infrastructure-Foundations: analytics1073 and analytics1075 - loss of connectivity - https://phabricator.wikimedia.org/T342141 (10Jclark-ctr) @BTullis I replaced both sfpt and link returned [13:07:56] it took a bunch of purging and ctrl-r's but it works now. please sync. [13:08:04] ok, thanks [13:09:45] (syncing) [13:12:28] (03PS4) 10Fabfur: haproxy: Add option to disable keepalive on port 80 [puppet] - 10https://gerrit.wikimedia.org/r/939661 (https://phabricator.wikimedia.org/T342211) [13:12:47] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Cyndymediawiksim - https://phabricator.wikimedia.org/T342230 (10Aklapper) @Cyndymediawiksim Thanks for connecting the LDAP account. Please also connect the correct MediaWiki SUL account (created by WMF ITS instead of self-created) if this Phabricator accou... [13:13:19] (03CR) 10Fabfur: haproxy: Add option to disable keepalive on port 80 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/939661 (https://phabricator.wikimedia.org/T342211) (owner: 10Fabfur) [13:13:38] made T342249 for catching this mistake in CI, though I’m not sure if Wikimedia-Site-requests is the right phab tag or not [13:13:39] T342249: Prevent incorrect variable name prefix in InitialiseSettings.php - https://phabricator.wikimedia.org/T342249 [13:13:40] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:939374|Fix incorrect use of UseLegacyMediaStyles (missing "wg" prefix) (T318433)]] (duration: 10m 47s) [13:13:43] T318433: Templates (and extensions) that mimic parser media output need migration to new structure - https://phabricator.wikimedia.org/T318433 [13:14:54] aanzx: why is the phab task attached to your change already closed? [13:15:22] Lucas_WMDE: only workmark was updated [13:15:30] on that task [13:15:32] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host analytics1073.eqiad.wmnet with OS bullseye [13:17:43] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Cyndymediawiksim - https://phabricator.wikimedia.org/T342230 (10Cyndymediawiksim) >>! In T342230#9027867, @Aklapper wrote: > @Cyndymediawiksim Thanks for connecting the LDAP account. Please also connect the correct MediaWiki SUL account (created by WMF ITS... [13:18:09] I don’t understand yet [13:18:21] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:rack/setup/install cp1[098-113] - https://phabricator.wikimedia.org/T342159 (10RobH) [13:19:02] only wordmark was done by jdlrobson , not logo so i created new patch for logo [13:19:34] but I don’t see a logo change in https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/939682 [13:19:46] the file that’s replaced is a wordmark, isn’t it? [13:21:21] Lucas_WMDE: ok i dindn't cjheck, i will recreate new patch tomorow [13:21:29] (03PS1) 10Jbond: puppetdb: Set X-Client headers [puppet] - 10https://gerrit.wikimedia.org/r/939685 (https://phabricator.wikimedia.org/T342214) [13:23:02] (03CR) 10Vgutierrez: [C: 03+1] haproxy: Add option to disable keepalive on port 80 [puppet] - 10https://gerrit.wikimedia.org/r/939661 (https://phabricator.wikimedia.org/T342211) (owner: 10Fabfur) [13:23:06] (03Abandoned) 10Anzx: update knwikisource logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939682 (https://phabricator.wikimedia.org/T341912) (owner: 10Anzx) [13:23:08] but in any case, if the task isn’t done yet then it sounds like it should be reopened [13:23:26] (03CR) 10Jbond: [C: 03+2] puppetdb: Set X-Client headers [puppet] - 10https://gerrit.wikimedia.org/r/939685 (https://phabricator.wikimedia.org/T342214) (owner: 10Jbond) [13:23:46] ok i will reopen it, thanks [13:23:49] (03PS1) 10JMeybohm: wikifunctions: Update orchestrator and evaluator [deployment-charts] - 10https://gerrit.wikimedia.org/r/939686 (https://phabricator.wikimedia.org/T297314) [13:23:52] (03PS1) 10JMeybohm: wikifunctions: Enable mesh and ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/939687 (https://phabricator.wikimedia.org/T297314) [13:23:56] ok [13:24:04] anything else to deploy? [13:24:22] (03CR) 10Giuseppe Lavagetto: noc: add script to dump etcd db config (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938644 (https://phabricator.wikimedia.org/T341859) (owner: 10Giuseppe Lavagetto) [13:24:29] (03CR) 10CI reject: [V: 04-1] wikifunctions: Update orchestrator and evaluator [deployment-charts] - 10https://gerrit.wikimedia.org/r/939686 (https://phabricator.wikimedia.org/T297314) (owner: 10JMeybohm) [13:24:34] (03CR) 10CI reject: [V: 04-1] wikifunctions: Enable mesh and ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/939687 (https://phabricator.wikimedia.org/T297314) (owner: 10JMeybohm) [13:24:44] Nothing else Lucas_WMDE [13:26:11] !log UTC afternoon backport+config window done [13:26:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:58] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/939663 (owner: 10Ssingh) [13:27:02] !log temporary disable puppet on cp3052 to apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/939661 (T342211) [13:27:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:05] T342211: Disable keep-alive on HAProxy port 80 - https://phabricator.wikimedia.org/T342211 [13:28:02] (03PS1) 10Elukey: Revert "ml-services: bump Docker image for ores-legacy" [deployment-charts] - 10https://gerrit.wikimedia.org/r/939339 [13:29:01] (03CR) 10Elukey: [C: 03+2] Revert "ml-services: bump Docker image for ores-legacy" [deployment-charts] - 10https://gerrit.wikimedia.org/r/939339 (owner: 10Elukey) [13:29:50] (03PS1) 10Jbond: puppetdb: add allow-header-cert-info: true to auth.conf [puppet] - 10https://gerrit.wikimedia.org/r/939689 (https://phabricator.wikimedia.org/T342214) [13:29:57] !log aborted previous operations, no need to disable puppet to apply that CR (https://gerrit.wikimedia.org/r/c/operations/puppet/+/939661) (T342211) [13:30:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:08] (03CR) 10Fabfur: [C: 03+2] haproxy: Add option to disable keepalive on port 80 [puppet] - 10https://gerrit.wikimedia.org/r/939661 (https://phabricator.wikimedia.org/T342211) (owner: 10Fabfur) [13:31:09] !log elukey@deploy1002 helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . [13:31:51] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42572/console" [puppet] - 10https://gerrit.wikimedia.org/r/939689 (https://phabricator.wikimedia.org/T342214) (owner: 10Jbond) [13:32:16] (03CR) 10Jbond: [V: 03+1 C: 03+2] puppetdb: add allow-header-cert-info: true to auth.conf [puppet] - 10https://gerrit.wikimedia.org/r/939689 (https://phabricator.wikimedia.org/T342214) (owner: 10Jbond) [13:32:54] (03PS4) 10Kimberly Sarabia: Turn off A/B Test in Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936333 (https://phabricator.wikimedia.org/T337956) [13:33:51] (03CR) 10Ssingh: [V: 03+1 C: 03+2] durum: bind anycast-healthchecker.service to nginx.service [puppet] - 10https://gerrit.wikimedia.org/r/939663 (owner: 10Ssingh) [13:35:28] 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack C8 - https://phabricator.wikimedia.org/T308339 (10akosiaris) @robh mw hosts are 3 api servers and 3 appservers. You can do them anytime. Also it requires is a downtime and a poweroff per the description. [13:36:04] 10SRE, 10ops-eqiad, 10Data-Platform-SRE, 10Infrastructure-Foundations: analytics1073 and analytics1075 - loss of connectivity - https://phabricator.wikimedia.org/T342141 (10BTullis) @Jclark-ctr - many thanks for doing that. I just checked with another run of the cookbook on analytics1073 and it doesn't loo... [13:37:46] some BGP alerts expected because of flapping sessions with the bird restarts [13:38:05] (on durum hosts) [13:39:12] 10SRE, 10SRE-Access-Requests, 10Data Engineering and Event Platform Team, 10Data-Platform-SRE: Add tchin to analytics-admins - https://phabricator.wikimedia.org/T342146 (10andrea.denisse) 05Open→03Resolved Marking as resolved. :) [13:39:36] !log bking@cumin1001 START - Cookbook sre.ganeti.makevm for new host flink-zk1003.eqiad.wmnet [13:39:38] !log bking@cumin1001 START - Cookbook sre.dns.netbox [13:40:10] 10SRE, 10Traffic: Disable keep-alive on HAProxy port 80 - https://phabricator.wikimedia.org/T342211 (10Fabfur) That was released @Wed 19 Jul 2023 01:32:50 PM UTC on cp3052.esams.wmnet to test. The results matches what we were expecting, so we'll deploy on all text@esams [13:40:24] (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/938897 (https://phabricator.wikimedia.org/T95064) (owner: 10Jbond) [13:40:29] (03PS1) 10Ssingh: durum: remove redundant whitespace in durum::common [puppet] - 10https://gerrit.wikimedia.org/r/939691 [13:42:05] !log bking@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM flink-zk1003.eqiad.wmnet - bking@cumin1001" [13:42:28] (03CR) 10JHathaway: [C: 03+1] "looks good, if possible I would add comments on the shellcheck ignores, so our future selves understand why you needed to ignore them" [puppet] - 10https://gerrit.wikimedia.org/r/938898 (https://phabricator.wikimedia.org/T95064) (owner: 10Jbond) [13:42:42] 10SRE, 10ops-eqiad, 10Data-Platform-SRE, 10Infrastructure-Foundations: analytics1073 and analytics1075 - loss of connectivity - https://phabricator.wikimedia.org/T342141 (10BTullis) > There are 2 ways you will be able to switch to using the 10G nic on those servers. 1- Decommission the server and provision... [13:42:50] !log bking@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM flink-zk1003.eqiad.wmnet - bking@cumin1001" [13:42:50] !log bking@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:42:50] !log bking@cumin1001 START - Cookbook sre.dns.wipe-cache flink-zk1003.eqiad.wmnet on all recursors [13:42:53] !log bking@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) flink-zk1003.eqiad.wmnet on all recursors [13:42:56] !log bking@cumin1001 START - Cookbook sre.dns.netbox [13:43:12] (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/936692 (https://phabricator.wikimedia.org/T340947) (owner: 10Jbond) [13:43:24] (03CR) 10Ssingh: [C: 03+2] durum: remove redundant whitespace in durum::common [puppet] - 10https://gerrit.wikimedia.org/r/939691 (owner: 10Ssingh) [13:43:31] (03PS1) 10Lucas Werkmeister (WMDE): tests: Test setting names [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939694 (https://phabricator.wikimedia.org/T342249) [13:44:53] 10SRE, 10ops-eqiad, 10Data-Platform-SRE, 10Infrastructure-Foundations: analytics1073 and analytics1075 - loss of connectivity - https://phabricator.wikimedia.org/T342141 (10Papaul) the interface came up an went down ` papaul@asw2-b-eqiad> show interfaces descriptions ge-7/0/15 Interface Admin Link D... [13:45:10] 10SRE, 10Data-Platform-SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10bking) [13:45:46] !log bking@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM flink-zk1003.eqiad.wmnet - bking@cumin1001" [13:45:49] 10SRE, 10Data Engineering and Event Platform Team, 10MW-on-K8s, 10serviceops: Migrate flink-cluster-taskmanager to connect to mw-on-k8s - https://phabricator.wikimedia.org/T342252 (10Joe) [13:45:52] 10SRE, 10ops-eqiad, 10Data-Platform-SRE, 10Infrastructure-Foundations: analytics1073 and analytics1075 - loss of connectivity - https://phabricator.wikimedia.org/T342141 (10Papaul) right now 1075 is showing up ` papaul@asw2-c-eqiad> show interfaces descriptions | match analytics1075 ge-7/0/5 up... [13:46:31] !log bking@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM flink-zk1003.eqiad.wmnet - bking@cumin1001" [13:46:31] !log bking@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:46:31] !log bking@cumin1001 START - Cookbook sre.dns.wipe-cache flink-zk1003.eqiad.wmnet on all recursors [13:46:34] !log bking@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) flink-zk1003.eqiad.wmnet on all recursors [13:46:43] !log bking@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host flink-zk1003.eqiad.wmnet [13:47:11] (03CR) 10Majavah: [C: 03+1] "looks great!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939694 (https://phabricator.wikimedia.org/T342249) (owner: 10Lucas Werkmeister (WMDE)) [13:47:17] 10SRE, 10Data Engineering and Event Platform Team, 10MW-on-K8s, 10serviceops: Migrate rdf-streaming-updater to connect to mw-on-k8s - https://phabricator.wikimedia.org/T342252 (10Joe) [13:48:34] 10SRE, 10Data Engineering and Event Platform Team, 10MW-on-K8s, 10serviceops: Migrate rdf-streaming-updater to connect to mw-on-k8s - https://phabricator.wikimedia.org/T342252 (10Joe) @dcausse not sure if you're the right person to ask, if not apologies; but I wanted to know if we're making any write reque... [13:49:18] taavi: thanks! should I “deploy” that test change now? [13:49:26] (or should it be reviewed by someone else, for example?) [13:49:31] 10SRE, 10Infrastructure-Foundations, 10Observability-Alerting, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): puppet JMX mappings - https://phabricator.wikimedia.org/T342253 (10jbond) [13:49:58] I'm not aware of any proper 'maintainers' for mw-config who would need to review it too [13:50:12] does it need a proper deployment or can you just git pull them to deploy1002? [13:50:16] * Lucas_WMDE looks at a few other changes that touched tests/ [13:50:21] (03CR) 10Hashar: [C: 03+1] contint: replace Apache 2.2 access control syntax for Jenkins proxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/932440 (https://phabricator.wikimedia.org/T338071) (owner: 10Dzahn) [13:50:31] I would run `scap backport` and hope that it skips the sync [13:50:33] (03PS1) 10Filippo Giunchedi: icinga_exporter: team-tag netops icinga alerts [puppet] - 10https://gerrit.wikimedia.org/r/939695 [13:50:37] like it does for beta changes [13:51:03] 10SRE, 10ops-eqiad, 10Data-Platform-SRE, 10Infrastructure-Foundations: analytics1073 and analytics1075 - loss of connectivity - https://phabricator.wikimedia.org/T342141 (10BTullis) >>! In T342141#9028025, @Papaul wrote: > right now 1075 is showing up > ` > papaul@asw2-c-eqiad> show interfaces description... [13:51:10] tests/ is not listed in beta_only_config_files in scap.cfg [13:51:38] ah ok [13:52:00] I thought maybe it skips anything that doesn’t touch known paths but sounds like the logic is the other way around [13:52:05] (03CR) 10Jbond: [C: 03+2] puppetdb-api-next: Add new puppetdb-api discovery record [puppet] - 10https://gerrit.wikimedia.org/r/939679 (https://phabricator.wikimedia.org/T342214) (owner: 10Jbond) [13:52:09] just a pull is probably fine [13:52:38] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/889819 and https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/888105 don’t look like a whole lot of extra review is needed [13:52:42] I’ll just +2 and pull then [13:53:30] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "“deploying” (I’ll just pull it, don’t think it needs to be synced)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939694 (https://phabricator.wikimedia.org/T342249) (owner: 10Lucas Werkmeister (WMDE)) [13:54:01] (03CR) 10Btullis: [C: 03+1] "The transfer has now finished, as per https://phabricator.wikimedia.org/T334055#9028088" [puppet] - 10https://gerrit.wikimedia.org/r/939654 (https://phabricator.wikimedia.org/T334055) (owner: 10Btullis) [13:54:11] (03Merged) 10jenkins-bot: tests: Test setting names [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939694 (https://phabricator.wikimedia.org/T342249) (owner: 10Lucas Werkmeister (WMDE)) [13:54:37] !log pulled [[gerrit:939694|tests: Test setting names (T342249)]] to deploy1002 (no scap sync needed, tests-only change) [13:54:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:41] T342249: Prevent incorrect variable name prefix in InitialiseSettings.php - https://phabricator.wikimedia.org/T342249 [13:56:06] (03CR) 10Jbond: [C: 03+2] puppetdb-api-next: add new discovery record for testing puppetdb-api [dns] - 10https://gerrit.wikimedia.org/r/939678 (https://phabricator.wikimedia.org/T342214) (owner: 10Jbond) [13:56:09] (03PS1) 10Ilias Sarantopoulos: ores-extension: enable lw on eswikiquotes and eswikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939697 (https://phabricator.wikimedia.org/T342115) [13:56:30] (03CR) 10Btullis: [V: 03+1 C: 03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42573/console" [puppet] - 10https://gerrit.wikimedia.org/r/939654 (https://phabricator.wikimedia.org/T334055) (owner: 10Btullis) [13:57:04] (03CR) 10Btullis: [V: 03+1 C: 03+2] Switch references from db1108 to db1208 [puppet] - 10https://gerrit.wikimedia.org/r/939654 (https://phabricator.wikimedia.org/T334055) (owner: 10Btullis) [13:57:18] 10SRE, 10Data Engineering and Event Platform Team, 10MW-on-K8s, 10serviceops: Migrate rdf-streaming-updater to connect to mw-on-k8s - https://phabricator.wikimedia.org/T342252 (10dcausse) >>! In T342252#9028035, @Joe wrote: > @dcausse not sure if you're the right person to ask, if not apologies; but I want... [13:59:02] 10SRE, 10Data Engineering and Event Platform Team, 10MW-on-K8s, 10serviceops: Migrate rdf-streaming-updater to connect to mw-on-k8s - https://phabricator.wikimedia.org/T342252 (10Joe) >>! In T342252#9028119, @dcausse wrote: >>>! In T342252#9028035, @Joe wrote: >> @dcausse not sure if you're the right perso... [14:00:04] Deploy window Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230719T1400) [14:00:56] (03PS1) 10Jbond: puppetdb::microservice: add -next domain [puppet] - 10https://gerrit.wikimedia.org/r/939698 (https://phabricator.wikimedia.org/T342214) [14:01:54] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Infrastructure, and 3 others: update systems to use new puppetdb instance - https://phabricator.wikimedia.org/T342214 (10jbond) [14:02:31] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2 NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42574/console" [puppet] - 10https://gerrit.wikimedia.org/r/939698 (https://phabricator.wikimedia.org/T342214) (owner: 10Jbond) [14:03:50] (03CR) 10Jbond: [V: 03+1 C: 03+2] puppetdb::microservice: add -next domain [puppet] - 10https://gerrit.wikimedia.org/r/939698 (https://phabricator.wikimedia.org/T342214) (owner: 10Jbond) [14:05:18] (03PS1) 10Giuseppe Lavagetto: services_proxy: add mw-api-int-async-ro [puppet] - 10https://gerrit.wikimedia.org/r/939700 (https://phabricator.wikimedia.org/T342252) [14:06:10] (03CR) 10Ladsgroup: [C: 03+1] noc: add script to dump etcd db config (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938644 (https://phabricator.wikimedia.org/T341859) (owner: 10Giuseppe Lavagetto) [14:07:32] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:09:54] (03PS1) 10Giuseppe Lavagetto: mw-api-int: bump replicas to 8 [deployment-charts] - 10https://gerrit.wikimedia.org/r/939701 (https://phabricator.wikimedia.org/T342252) [14:09:56] (03PS1) 10Giuseppe Lavagetto: rdf-streaming-updater: move to mw-api-int, use readonly endpoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/939702 (https://phabricator.wikimedia.org/T342252) [14:10:55] (03CR) 10CI reject: [V: 04-1] rdf-streaming-updater: move to mw-api-int, use readonly endpoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/939702 (https://phabricator.wikimedia.org/T342252) (owner: 10Giuseppe Lavagetto) [14:11:30] (03PS1) 10Ssingh: sre.dns: add a new cookbook for durum reboot/service restarts [cookbooks] - 10https://gerrit.wikimedia.org/r/939704 [14:13:59] 10SRE, 10observability: Consider making a variant of the fatalmonitor CLI tool that ignores appserver timeouts - https://phabricator.wikimedia.org/T213777 (10lmata) Fatalmonitor no longer actively supported: https://wikitech.wikimedia.org/wiki/Wikimedia_binaries#fatalmonitor Untagging observability, please re-... [14:14:15] !log bking@cumin1001 START - Cookbook sre.ganeti.makevm for new host flink-zk1003.eqiad.wmnet [14:14:15] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host analytics1075.eqiad.wmnet with OS bullseye [14:14:16] !log bking@cumin1001 START - Cookbook sre.dns.netbox [14:14:21] 10SRE: Consider making a variant of the fatalmonitor CLI tool that ignores appserver timeouts - https://phabricator.wikimedia.org/T213777 (10colewhite) [14:14:36] 10SRE, 10Data-Engineering, 10Data-Platform-SRE, 10SRE Observability: Grant IdempotentWrite Kafka Cluster ACL to User:ANONYMOUS in all Kafka clusters - https://phabricator.wikimedia.org/T334733 (10lmata) [14:15:12] (03CR) 10Ssingh: [C: 03+1] "Nice work and catch!" [cookbooks] - 10https://gerrit.wikimedia.org/r/939381 (https://phabricator.wikimedia.org/T342182) (owner: 10BCornwall) [14:16:06] !log bking@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [14:16:08] (03CR) 10Ssingh: [C: 03+1] Remove custom Puppet disable on WDNS reboot (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/939381 (https://phabricator.wikimedia.org/T342182) (owner: 10BCornwall) [14:16:10] !log bking@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host flink-zk1003.eqiad.wmnet [14:17:32] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:19:16] (03CR) 10ArielGlenn: [C: 04-2] "This change is incomplete; don't review or merge yet, thanks" [puppet] - 10https://gerrit.wikimedia.org/r/939674 (https://phabricator.wikimedia.org/T325232) (owner: 10ArielGlenn) [14:19:49] !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host analytics1073.eqiad.wmnet with OS bullseye [14:20:21] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host analytics1073.eqiad.wmnet with OS bullseye [14:21:03] (03PS1) 10Jbond: netbox: make the puppetdb microservic domain configurable [puppet] - 10https://gerrit.wikimedia.org/r/939706 (https://phabricator.wikimedia.org/T342214) [14:21:10] !log btullis@cumin1001 START - Cookbook sre.hadoop.roll-restart-masters restart masters for Hadoop test cluster: Restart of jvm daemons. [14:21:36] (03PS1) 10Fabfur: haproxy: disable keepalive on port 80 for cp5024 [puppet] - 10https://gerrit.wikimedia.org/r/939707 (https://phabricator.wikimedia.org/T342211) [14:21:56] (03PS1) 10Ssingh: dns5003: temporarily remove from authdns_servers for restart [puppet] - 10https://gerrit.wikimedia.org/r/939708 [14:22:10] 10SRE, 10Data-Engineering, 10Data-Platform-SRE: Grant IdempotentWrite Kafka Cluster ACL to User:ANONYMOUS in all Kafka clusters - https://phabricator.wikimedia.org/T334733 (10herron) Untagging observability to table this wrt the kafka-logging cluster for the time being. Will need to revisit the kafka-loggin... [14:22:14] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42575/console" [puppet] - 10https://gerrit.wikimedia.org/r/939706 (https://phabricator.wikimedia.org/T342214) (owner: 10Jbond) [14:24:09] (03CR) 10Ssingh: [C: 03+2] dns5003: temporarily remove from authdns_servers for restart [puppet] - 10https://gerrit.wikimedia.org/r/939708 (owner: 10Ssingh) [14:25:44] (03PS1) 10Jbond: netbox: drop pupetdb_host as its not used [puppet] - 10https://gerrit.wikimedia.org/r/939709 (https://phabricator.wikimedia.org/T342214) [14:26:27] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (NOOP 3 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42576/console" [puppet] - 10https://gerrit.wikimedia.org/r/939707 (https://phabricator.wikimedia.org/T342211) (owner: 10Fabfur) [14:26:32] (03CR) 10Ayounsi: "I can't review the syntax itself but the logic lgtm with 1 comment." [puppet] - 10https://gerrit.wikimedia.org/r/939695 (owner: 10Filippo Giunchedi) [14:26:56] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42577/console" [puppet] - 10https://gerrit.wikimedia.org/r/939709 (https://phabricator.wikimedia.org/T342214) (owner: 10Jbond) [14:27:58] (03CR) 10Jbond: [V: 03+1 C: 03+2] netbox: make the puppetdb microservic domain configurable [puppet] - 10https://gerrit.wikimedia.org/r/939706 (https://phabricator.wikimedia.org/T342214) (owner: 10Jbond) [14:28:35] !log bking@cumin1001 START - Cookbook sre.ganeti.makevm for new host flink-zk1003.eqiad.wmnet [14:28:36] !log bking@cumin1001 START - Cookbook sre.dns.netbox [14:29:30] (03CR) 10Jbond: [V: 03+1 C: 03+2] netbox: drop pupetdb_host as its not used [puppet] - 10https://gerrit.wikimedia.org/r/939709 (https://phabricator.wikimedia.org/T342214) (owner: 10Jbond) [14:29:43] DNS/BGP alerts in eqsin expected, restarts of DNS hosts [14:29:51] will be keeping an eye out here but no cause for alarm [14:30:01] !log bking@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [14:30:05] !log bking@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host flink-zk1003.eqiad.wmnet [14:32:28] (03PS1) 10Jbond: netbox::standalone: switch to using new puppetdb api [puppet] - 10https://gerrit.wikimedia.org/r/939710 (https://phabricator.wikimedia.org/T342214) [14:32:51] PROBLEM - Bird Internet Routing Daemon on dns5003 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [14:32:58] ^ expected [14:33:15] PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:33:24] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host dns5003.wikimedia.org [14:33:33] PROBLEM - BFD status on cr3-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:33:49] !log jclark@cumin1001 START - Cookbook sre.dns.netbox [14:33:54] (03PS2) 10Jbond: netbox::standalone: switch to using new puppetdb api [puppet] - 10https://gerrit.wikimedia.org/r/939710 (https://phabricator.wikimedia.org/T342214) [14:35:00] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42579/console" [puppet] - 10https://gerrit.wikimedia.org/r/939710 (https://phabricator.wikimedia.org/T342214) (owner: 10Jbond) [14:35:05] !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:35:41] (03PS1) 10ArielGlenn: swap in dumpsdata1007 as the new fallback xml dumps nfs share [puppet] - 10https://gerrit.wikimedia.org/r/939711 (https://phabricator.wikimedia.org/T325232) [14:36:52] (03PS3) 10Jbond: netbox::standalone: switch to using new puppetdb api [puppet] - 10https://gerrit.wikimedia.org/r/939710 (https://phabricator.wikimedia.org/T342214) [14:36:54] (03PS1) 10Jbond: netbox: actully use puppetdb_microservice_fqdn [puppet] - 10https://gerrit.wikimedia.org/r/939712 (https://phabricator.wikimedia.org/T342214) [14:37:06] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dns5003.wikimedia.org [14:37:25] PROBLEM - Bird Internet Routing Daemon on dns5003 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [14:38:01] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42580/console" [puppet] - 10https://gerrit.wikimedia.org/r/939710 (https://phabricator.wikimedia.org/T342214) (owner: 10Jbond) [14:38:06] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42581/console" [puppet] - 10https://gerrit.wikimedia.org/r/939712 (https://phabricator.wikimedia.org/T342214) (owner: 10Jbond) [14:38:09] !log robh@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host mw1413 [14:38:27] (03CR) 10Alexandros Kosiaris: modules: Add a new networkpolicy for base modules (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/935746 (https://phabricator.wikimedia.org/T340843) (owner: 10Alexandros Kosiaris) [14:38:33] (03CR) 10Jbond: [V: 03+1 C: 03+2] netbox: actully use puppetdb_microservice_fqdn [puppet] - 10https://gerrit.wikimedia.org/r/939712 (https://phabricator.wikimedia.org/T342214) (owner: 10Jbond) [14:38:38] 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2022/2023-Q4): eqiad1: cloudlb: reimage cloudcontrol1005 into new network setup - https://phabricator.wikimedia.org/T341495 (10Jclark-ctr) [14:39:15] !log robh@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host mw1413 [14:39:27] (03PS1) 10Ssingh: Revert "dns5003: temporarily remove from authdns_servers for restart" [puppet] - 10https://gerrit.wikimedia.org/r/939342 [14:39:42] !log robh@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host mw1412 [14:40:16] (03PS2) 10Fabfur: haproxy: disable keepalive on port 80 for cp5024 [puppet] - 10https://gerrit.wikimedia.org/r/939707 (https://phabricator.wikimedia.org/T342211) [14:40:25] RECOVERY - Bird Internet Routing Daemon on dns5003 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [14:40:49] RECOVERY - BFD status on cr2-eqsin is OK: UP: 14 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:40:49] !log robh@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host mw1412 [14:41:07] RECOVERY - BFD status on cr3-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:41:58] (03CR) 10Ssingh: [C: 03+2] Revert "dns5003: temporarily remove from authdns_servers for restart" [puppet] - 10https://gerrit.wikimedia.org/r/939342 (owner: 10Ssingh) [14:42:37] (03PS2) 10Alexandros Kosiaris: modules: Add a new networkpolicy for base modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/935746 (https://phabricator.wikimedia.org/T340843) [14:42:39] (03PS2) 10Alexandros Kosiaris: cxserver: Bump to networkpolicy_1.1.0.tpl [deployment-charts] - 10https://gerrit.wikimedia.org/r/935748 (https://phabricator.wikimedia.org/T341117) [14:42:41] (03PS2) 10Alexandros Kosiaris: cxserver: Migrate to the new MariaDB egress functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/935749 (https://phabricator.wikimedia.org/T341117) [14:43:19] (03CR) 10CI reject: [V: 04-1] modules: Add a new networkpolicy for base modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/935746 (https://phabricator.wikimedia.org/T340843) (owner: 10Alexandros Kosiaris) [14:43:23] (03CR) 10CI reject: [V: 04-1] cxserver: Bump to networkpolicy_1.1.0.tpl [deployment-charts] - 10https://gerrit.wikimedia.org/r/935748 (https://phabricator.wikimedia.org/T341117) (owner: 10Alexandros Kosiaris) [14:43:25] (03CR) 10CI reject: [V: 04-1] cxserver: Migrate to the new MariaDB egress functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/935749 (https://phabricator.wikimedia.org/T341117) (owner: 10Alexandros Kosiaris) [14:43:27] (03CR) 10Ayounsi: [C: 03+1] netbox::standalone: switch to using new puppetdb api [puppet] - 10https://gerrit.wikimedia.org/r/939710 (https://phabricator.wikimedia.org/T342214) (owner: 10Jbond) [14:43:29] (03PS3) 10Alexandros Kosiaris: modules: Add a new networkpolicy for base modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/935746 (https://phabricator.wikimedia.org/T340843) [14:43:31] (03PS3) 10Alexandros Kosiaris: cxserver: Bump to networkpolicy_1.1.0.tpl [deployment-charts] - 10https://gerrit.wikimedia.org/r/935748 (https://phabricator.wikimedia.org/T341117) [14:43:33] (03PS3) 10Alexandros Kosiaris: cxserver: Migrate to the new MariaDB egress functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/935749 (https://phabricator.wikimedia.org/T341117) [14:43:49] (03CR) 10Jbond: [V: 03+1 C: 03+2] netbox::standalone: switch to using new puppetdb api [puppet] - 10https://gerrit.wikimedia.org/r/939710 (https://phabricator.wikimedia.org/T342214) (owner: 10Jbond) [14:44:09] (03CR) 10CI reject: [V: 04-1] cxserver: Bump to networkpolicy_1.1.0.tpl [deployment-charts] - 10https://gerrit.wikimedia.org/r/935748 (https://phabricator.wikimedia.org/T341117) (owner: 10Alexandros Kosiaris) [14:44:11] (03CR) 10CI reject: [V: 04-1] modules: Add a new networkpolicy for base modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/935746 (https://phabricator.wikimedia.org/T340843) (owner: 10Alexandros Kosiaris) [14:44:19] (03CR) 10CI reject: [V: 04-1] cxserver: Migrate to the new MariaDB egress functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/935749 (https://phabricator.wikimedia.org/T341117) (owner: 10Alexandros Kosiaris) [14:44:42] (03PS1) 10Ayounsi: Add Loopback to INTERFACES_REGEXP [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/939714 [14:45:35] (03CR) 10Alexandros Kosiaris: cxserver: Bump to networkpolicy_1.1.0.tpl (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/935748 (https://phabricator.wikimedia.org/T341117) (owner: 10Alexandros Kosiaris) [14:45:42] (03CR) 10Alexandros Kosiaris: cxserver: Bump to networkpolicy_1.1.0.tpl (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/935748 (https://phabricator.wikimedia.org/T341117) (owner: 10Alexandros Kosiaris) [14:47:15] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/939714 (owner: 10Ayounsi) [14:48:35] 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack C8 - https://phabricator.wikimedia.org/T308339 (10RobH) [14:48:37] (03PS2) 10Giuseppe Lavagetto: mw-api-int: bump replicas to 8 [deployment-charts] - 10https://gerrit.wikimedia.org/r/939701 (https://phabricator.wikimedia.org/T342252) [14:48:39] (03PS2) 10Giuseppe Lavagetto: rdf-streaming-updater: move to mw-api-int, use readonly endpoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/939702 (https://phabricator.wikimedia.org/T342252) [14:48:43] (03PS1) 10Giuseppe Lavagetto: mw-api-int: increase namespace limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/939716 (https://phabricator.wikimedia.org/T342252) [14:49:29] (03CR) 10CI reject: [V: 04-1] rdf-streaming-updater: move to mw-api-int, use readonly endpoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/939702 (https://phabricator.wikimedia.org/T342252) (owner: 10Giuseppe Lavagetto) [14:49:35] (03CR) 10Jbond: "lgtm but see comment re examples" [cookbooks] - 10https://gerrit.wikimedia.org/r/939704 (owner: 10Ssingh) [14:51:03] !log jclark@cumin1001 START - Cookbook sre.dns.netbox [14:51:05] (03PS2) 10Ayounsi: Add Loopback to INTERFACES_REGEXP [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/939714 [14:51:44] (03CR) 10Ayounsi: [C: 03+2] Add Loopback to INTERFACES_REGEXP [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/939714 (owner: 10Ayounsi) [14:52:15] (03Merged) 10jenkins-bot: Add Loopback to INTERFACES_REGEXP [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/939714 (owner: 10Ayounsi) [14:53:14] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt cloudcontrol1005 - jclark@cumin1001" [14:53:30] (03PS2) 10JMeybohm: wikifunctions: Update orchestrator and evaluator [deployment-charts] - 10https://gerrit.wikimedia.org/r/939686 (https://phabricator.wikimedia.org/T297314) [14:53:32] (03PS2) 10JMeybohm: wikifunctions: Enable mesh and ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/939687 (https://phabricator.wikimedia.org/T297314) [14:53:34] (03PS1) 10JMeybohm: CI: TestOutcome for diffs requires stdout to not be empty [deployment-charts] - 10https://gerrit.wikimedia.org/r/939718 (https://phabricator.wikimedia.org/T297314) [14:53:36] (03PS1) 10Elukey: knative-serving: add options to tune every config-map [deployment-charts] - 10https://gerrit.wikimedia.org/r/939719 [14:53:38] (03PS1) 10Elukey: admin_ng: set scale-down value for knative-serving [deployment-charts] - 10https://gerrit.wikimedia.org/r/939720 [14:54:09] !log jclark@cumin1001 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt cloudcontrol1005 - jclark@cumin1001" [14:54:09] !log jclark@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [14:54:17] (03PS2) 10Filippo Giunchedi: icinga_exporter: team-tag netops icinga alerts [puppet] - 10https://gerrit.wikimedia.org/r/939695 [14:54:28] !log ayounsi@cumin1001 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary [14:54:29] (03CR) 10CI reject: [V: 04-1] wikifunctions: Update orchestrator and evaluator [deployment-charts] - 10https://gerrit.wikimedia.org/r/939686 (https://phabricator.wikimedia.org/T297314) (owner: 10JMeybohm) [14:54:41] (03CR) 10CI reject: [V: 04-1] wikifunctions: Enable mesh and ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/939687 (https://phabricator.wikimedia.org/T297314) (owner: 10JMeybohm) [14:55:00] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary [14:55:00] (03CR) 10Filippo Giunchedi: icinga_exporter: team-tag netops icinga alerts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/939695 (owner: 10Filippo Giunchedi) [14:55:16] (03PS1) 10Btullis: Fail back hive services to the primary server [dns] - 10https://gerrit.wikimedia.org/r/939721 (https://phabricator.wikimedia.org/T329716) [14:55:33] 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2022/2023-Q4): eqiad1: cloudlb: reimage cloudcontrol1005 into new network setup - https://phabricator.wikimedia.org/T341495 (10Jclark-ctr) [14:56:19] (03CR) 10CI reject: [V: 04-1] Fail back hive services to the primary server [dns] - 10https://gerrit.wikimedia.org/r/939721 (https://phabricator.wikimedia.org/T329716) (owner: 10Btullis) [14:56:42] (03PS1) 10Ayounsi: Also add SONiC vlan naming to INTERFACES_REGEXP [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/939722 [14:56:58] (03CR) 10Jbond: [C: 03+2] ssh: switch to using the same file we use in production [puppet] - 10https://gerrit.wikimedia.org/r/936692 (https://phabricator.wikimedia.org/T340947) (owner: 10Jbond) [14:57:24] (03CR) 10Ayounsi: [C: 03+2] Also add SONiC vlan naming to INTERFACES_REGEXP [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/939722 (owner: 10Ayounsi) [14:57:56] (03Merged) 10jenkins-bot: Also add SONiC vlan naming to INTERFACES_REGEXP [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/939722 (owner: 10Ayounsi) [14:58:15] !log ayounsi@cumin1001 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary [14:58:35] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary [14:59:42] !log ayounsi@cumin1001 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [15:01:29] (03CR) 10Vgutierrez: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/939707 (https://phabricator.wikimedia.org/T342211) (owner: 10Fabfur) [15:01:46] (03CR) 10Fabfur: [C: 03+2] haproxy: disable keepalive on port 80 for cp5024 [puppet] - 10https://gerrit.wikimedia.org/r/939707 (https://phabricator.wikimedia.org/T342211) (owner: 10Fabfur) [15:03:00] !log disabling keepalive on port 80 for cp5024 https://gerrit.wikimedia.org/r/939707 (T342211) [15:03:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:05] T342211: Disable keep-alive on HAProxy port 80 - https://phabricator.wikimedia.org/T342211 [15:04:24] 10ops-eqiad, 10DBA, 10decommission-hardware: decommission dbproxy1016.eqiad.wmnet - https://phabricator.wikimedia.org/T342224 (10wiki_willy) a:05wiki_willy→03Jclark-ctr [15:04:47] (03PS4) 10Alexandros Kosiaris: modules: Add a new networkpolicy for base modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/935746 (https://phabricator.wikimedia.org/T340843) [15:04:49] (03PS4) 10Alexandros Kosiaris: cxserver: Bump to networkpolicy_1.1.0.tpl [deployment-charts] - 10https://gerrit.wikimedia.org/r/935748 (https://phabricator.wikimedia.org/T341117) [15:04:51] (03PS4) 10Alexandros Kosiaris: cxserver: Migrate to the new MariaDB egress functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/935749 (https://phabricator.wikimedia.org/T341117) [15:05:32] (03CR) 10CI reject: [V: 04-1] cxserver: Bump to networkpolicy_1.1.0.tpl [deployment-charts] - 10https://gerrit.wikimedia.org/r/935748 (https://phabricator.wikimedia.org/T341117) (owner: 10Alexandros Kosiaris) [15:05:37] (03CR) 10CI reject: [V: 04-1] cxserver: Migrate to the new MariaDB egress functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/935749 (https://phabricator.wikimedia.org/T341117) (owner: 10Alexandros Kosiaris) [15:06:22] (03PS1) 10Giuseppe Lavagetto: Add the utils directory; tool to generate reports about deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/939723 [15:06:49] (03CR) 10ArielGlenn: [C: 03+2] swap in dumpsdata1007 as the new fallback xml dumps nfs share [puppet] - 10https://gerrit.wikimedia.org/r/939711 (https://phabricator.wikimedia.org/T325232) (owner: 10ArielGlenn) [15:07:18] 10SRE, 10Traffic: increased 5xx rate for esams frontend traffic - https://phabricator.wikimedia.org/T342121 (10TheDJ) 05Resolved→03Open Hmm. actually.. Seems there is also an exceptional amount of 4xx errors ? Especially today it seems to have exploded. https://grafana.wikimedia.org/d/000000479/cdn-fronte... [15:07:24] (03PS8) 10Jforrester: [DNM] Initial configuration for Wikifunctions.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/934631 (https://phabricator.wikimedia.org/T275945) [15:07:28] (03PS5) 10Alexandros Kosiaris: cxserver: Bump to networkpolicy_1.1.0.tpl [deployment-charts] - 10https://gerrit.wikimedia.org/r/935748 (https://phabricator.wikimedia.org/T341117) [15:07:30] (03PS5) 10Alexandros Kosiaris: cxserver: Migrate to the new MariaDB egress functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/935749 (https://phabricator.wikimedia.org/T341117) [15:07:41] (03CR) 10Klausman: [C: 03+1] knative-serving: add options to tune every config-map [deployment-charts] - 10https://gerrit.wikimedia.org/r/939719 (owner: 10Elukey) [15:07:49] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [15:08:06] 10ops-eqiad, 10DBA, 10decommission-hardware: decommission dbproxy1016.eqiad.wmnet - https://phabricator.wikimedia.org/T342224 (10Jclark-ctr) [15:08:10] (03CR) 10CI reject: [V: 04-1] cxserver: Bump to networkpolicy_1.1.0.tpl [deployment-charts] - 10https://gerrit.wikimedia.org/r/935748 (https://phabricator.wikimedia.org/T341117) (owner: 10Alexandros Kosiaris) [15:08:13] (03CR) 10CI reject: [V: 04-1] cxserver: Migrate to the new MariaDB egress functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/935749 (https://phabricator.wikimedia.org/T341117) (owner: 10Alexandros Kosiaris) [15:08:16] 10ops-eqiad, 10DBA, 10decommission-hardware: decommission dbproxy1016.eqiad.wmnet - https://phabricator.wikimedia.org/T342224 (10Jclark-ctr) 05Open→03Resolved [15:09:54] !log bking@cumin1001 START - Cookbook sre.ganeti.makevm for new host flink-zk1003.eqiad.wmnet [15:09:55] !log bking@cumin1001 START - Cookbook sre.dns.netbox [15:10:14] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Infrastructure, and 3 others: update systems to use new puppetdb instance - https://phabricator.wikimedia.org/T342214 (10jbond) for puppetdb-api. i have updated netbox-next and tested the following: === Reports * [[ https://netbox-next.wikimedia... [15:11:01] !log robh@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host mw1411 [15:11:02] (03PS6) 10Alexandros Kosiaris: cxserver: Bump to networkpolicy_1.1.0.tpl [deployment-charts] - 10https://gerrit.wikimedia.org/r/935748 (https://phabricator.wikimedia.org/T341117) [15:11:04] !log robh@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host mw1411 [15:11:04] (03PS6) 10Alexandros Kosiaris: cxserver: Migrate to the new MariaDB egress functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/935749 (https://phabricator.wikimedia.org/T341117) [15:11:25] !log bking@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [15:11:28] !log bking@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host flink-zk1003.eqiad.wmnet [15:12:49] authdns-update is failing [15:13:07] seems to be some incorrect netbox changes [15:13:11] (03PS3) 10JMeybohm: wikifunctions: Update orchestrator and evaluator [deployment-charts] - 10https://gerrit.wikimedia.org/r/939686 (https://phabricator.wikimedia.org/T297314) [15:13:14] (03PS3) 10JMeybohm: wikifunctions: Enable mesh and ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/939687 (https://phabricator.wikimedia.org/T297314) [15:13:16] (03CR) 10Ilias Sarantopoulos: [C: 03+1] admin_ng: set scale-down value for knative-serving [deployment-charts] - 10https://gerrit.wikimedia.org/r/939720 (owner: 10Elukey) [15:13:21] E103|TOO_MANY_NAMES: Found 2 name(s) for IP '10.65.4.188', expected 1: netbox/eqiad.wmnet:491 cloudcontrol1005.eqiad.wmnet. A 10.65.4.188 netbox/eqiad.wmnet:2130 wmf5349.eqiad.wmnet. A 10.65.4.188 [15:13:52] any idea who is working on this? [15:13:56] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1046.eqiad.wmnet [15:13:59] (03CR) 10Ilias Sarantopoulos: [C: 03+1] knative-serving: add options to tune every config-map [deployment-charts] - 10https://gerrit.wikimedia.org/r/939719 (owner: 10Elukey) [15:15:16] (03PS1) 10Jbond: sre.discovery.datacenter: exclude puppetdb-api-next [cookbooks] - 10https://gerrit.wikimedia.org/r/939725 (https://phabricator.wikimedia.org/T342214) [15:18:01] !log sukhe@cumin2002 START - Cookbook sre.dns.netbox [15:18:28] !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox [15:18:35] XioNoX: ^ [15:18:44] I am running to clear up broken authdns-update [15:18:47] see above [15:18:49] 11:13:21 < sukhe> E103|TOO_MANY_NAMES: Found 2 name(s) for IP '10.65.4.188', expected 1: netbox/eqiad.wmnet:491 cloudcontrol1005.eqiad.wmnet. A 10.65.4.188 netbox/eqiad.wmnet:2130 wmf5349.eqiad.wmnet. A 10.65.4.188 [15:18:58] eh [15:19:00] !log ayounsi@cumin1001 END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) [15:19:05] cancelled mine [15:19:06] (03PS1) 10Jbond: DO NOT MERGE: Change to test new puppetdb-api-next [cookbooks] - 10https://gerrit.wikimedia.org/r/939726 (https://phabricator.wikimedia.org/T342214) [15:19:08] thanks [15:19:16] not sure what happened here but trying to run and see [15:19:17] +1 to merge my lsw1-e8 related changes [15:19:22] ok thanks [15:19:33] hopefully we are in a position to merge :P [15:19:34] (03CR) 10Alexandros Kosiaris: [C: 04-1] "I dislike that I needed to add the mappings to the chart. I need to revisit this a bit." [deployment-charts] - 10https://gerrit.wikimedia.org/r/935748 (https://phabricator.wikimedia.org/T341117) (owner: 10Alexandros Kosiaris) [15:19:54] cool, that worked [15:20:07] (03CR) 10Elukey: [C: 03+2] knative-serving: add options to tune every config-map [deployment-charts] - 10https://gerrit.wikimedia.org/r/939719 (owner: 10Elukey) [15:20:08] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1046.eqiad.wmnet [15:20:24] spoke too soon [15:20:27] (03CR) 10Alexandros Kosiaris: [C: 04-1] "In the followup patch I needed to add the mappings to the values of the chart. This is clearly not sustainable long-term, but I think I ca" [deployment-charts] - 10https://gerrit.wikimedia.org/r/935746 (https://phabricator.wikimedia.org/T340843) (owner: 10Alexandros Kosiaris) [15:20:31] sukhe: not sure what happened but https://netbox.wikimedia.org/ipam/ip-addresses/2150/changelog/ [15:20:42] (03CR) 10Elukey: [C: 03+2] admin_ng: set scale-down value for knative-serving [deployment-charts] - 10https://gerrit.wikimedia.org/r/939720 (owner: 10Elukey) [15:21:05] yeah, this is broken: [15:21:05] E103|TOO_MANY_NAMES: Found 2 name(s) for IP '10.65.4.188', expected 1: netbox/eqiad.wmnet:491 cloudcontrol1005.eqiad.wmnet. A 10.65.4.188 netbox/eqiad.wmnet:2136 wmf5349.eqiad.wmnet. A 10.65.4.188 [15:21:19] going to ping jclark [15:21:44] ohhh [15:21:46] sukhe: I know [15:21:50] DNS Name cloudcontrol1005.eqiad.wmnet [15:21:54] it should be .mgmt. [15:22:20] aaah right indeed [15:22:22] Asset Tag WMF5349 [15:22:42] sukhe: fixed in netbox [15:23:07] <3 [15:23:08] trying [15:23:20] !log jbond@cumin1001 START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet with OS bullseye [15:23:29] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [15:23:31] !log sukhe@cumin2002 START - Cookbook sre.dns.netbox [15:23:31] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Infrastructure, and 3 others: update systems to use new puppetdb instance - https://phabricator.wikimedia.org/T342214 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jbond@cumin1001 for host sretest1002.eqiad.wmnet with O... [15:23:55] (03PS1) 10Ahmon Dancy: Fix buildkitd.toml.erb [puppet] - 10https://gerrit.wikimedia.org/r/939730 (https://phabricator.wikimedia.org/T329220) [15:24:20] please hold off on running authdns-update [15:25:23] !log sukhe@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: trying to resolve netbox issues - sukhe@cumin2002" [15:26:32] !log sukhe@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: trying to resolve netbox issues - sukhe@cumin2002" [15:26:32] !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:26:35] ok, resolved. thanks to XioNoX for fixing the broken record! [15:27:04] oh thats good i was worried the reimage i just started was going to run the dns cookbook before it was fixed :) [15:27:33] (03PS5) 10Alexandros Kosiaris: modules: Add a new networkpolicy for base modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/935746 (https://phabricator.wikimedia.org/T340843) [15:27:35] (03PS7) 10Alexandros Kosiaris: cxserver: Bump to networkpolicy_1.1.0.tpl [deployment-charts] - 10https://gerrit.wikimedia.org/r/935748 (https://phabricator.wikimedia.org/T341117) [15:27:37] (03PS7) 10Alexandros Kosiaris: cxserver: Migrate to the new MariaDB egress functionality [deployment-charts] - 10https://gerrit.wikimedia.org/r/935749 (https://phabricator.wikimedia.org/T341117) [15:28:39] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [15:29:57] (03PS2) 10Ssingh: sre.dns: add a new cookbook for durum reboot/service restarts [cookbooks] - 10https://gerrit.wikimedia.org/r/939704 [15:29:57] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [15:30:27] (03CR) 10Alexandros Kosiaris: modules: Add a new networkpolicy for base modules (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/935746 (https://phabricator.wikimedia.org/T340843) (owner: 10Alexandros Kosiaris) [15:30:32] (03CR) 10Ssingh: sre.dns: add a new cookbook for durum reboot/service restarts (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/939704 (owner: 10Ssingh) [15:30:59] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [15:32:01] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [15:34:20] !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host analytics1075.eqiad.wmnet with OS bullseye [15:34:44] (03PS4) 10JMeybohm: wikifunctions: Enable mesh and ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/939687 (https://phabricator.wikimedia.org/T297314) [15:35:03] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:35:33] !log dumpsdata1007 is now the fallback host for sql/xml dumps and for misc dumps. dumpsdata1004, the former fallback host, is now a spare. [15:35:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:23] (03PS1) 10Ayounsi: IP validator, make sure mgmt IPs have mgmt in their DNS name [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/939732 [15:37:31] :D [15:37:39] sukhe, jbond: https://gerrit.wikimedia.org/r/c/operations/software/netbox-extras/+/939732 [15:37:53] ty! looking [15:38:12] 10ops-eqiad, 10DBA, 10decommission-hardware: decommission dbproxy1015.eqiad.wmnet - https://phabricator.wikimedia.org/T342103 (10Jclark-ctr) [15:38:16] I'll test it on netbox-next before rolling to prod, but I need to merge it first [15:38:19] 10ops-eqiad, 10DBA, 10decommission-hardware: decommission dbproxy1015.eqiad.wmnet - https://phabricator.wikimedia.org/T342103 (10Jclark-ctr) 05Open→03Resolved [15:38:33] (03CR) 10Ssingh: [C: 03+1] "Not familiar with this repo but in theory and the idea looks good." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/939732 (owner: 10Ayounsi) [15:38:37] (03PS1) 10Elukey: knative-serving: removing default logging config [deployment-charts] - 10https://gerrit.wikimedia.org/r/939733 [15:39:18] (03CR) 10Jbond: [C: 03+1] "lgtm optimisation inline" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/939732 (owner: 10Ayounsi) [15:39:22] (03CR) 10Klausman: [C: 03+1] knative-serving: removing default logging config [deployment-charts] - 10https://gerrit.wikimedia.org/r/939733 (owner: 10Elukey) [15:40:03] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (PATCH configmaps) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:40:24] !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host analytics1073.eqiad.wmnet with OS bullseye [15:40:37] (03PS1) 10Ssingh: dns5004: temporarily remove from authdns_servers for restart [puppet] - 10https://gerrit.wikimedia.org/r/939734 [15:42:05] (03PS2) 10Ayounsi: IP validator, make sure mgmt IPs have mgmt in their DNS name [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/939732 [15:42:25] sukhe, jbond, thx, updated with your optimisation [15:42:32] (03CR) 10Ayounsi: IP validator, make sure mgmt IPs have mgmt in their DNS name (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/939732 (owner: 10Ayounsi) [15:42:44] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/939732 (owner: 10Ayounsi) [15:42:47] XioNoX: +1 [15:42:53] (03CR) 10Jforrester: "This is a lot simpler, thank you!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/939686 (https://phabricator.wikimedia.org/T297314) (owner: 10JMeybohm) [15:42:56] (03CR) 10Ayounsi: [C: 03+2] IP validator, make sure mgmt IPs have mgmt in their DNS name [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/939732 (owner: 10Ayounsi) [15:43:25] (03Merged) 10jenkins-bot: IP validator, make sure mgmt IPs have mgmt in their DNS name [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/939732 (owner: 10Ayounsi) [15:43:36] !log ayounsi@cumin1001 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary [15:43:48] XioNoX: thanks for the patch <3 [15:44:08] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary [15:44:22] PROBLEM - Host parse1002 is DOWN: PING CRITICAL - Packet loss = 100% [15:44:46] !log jbond@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest1002.eqiad.wmnet with OS bullseye [15:44:57] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Infrastructure, and 3 others: update systems to use new puppetdb instance - https://phabricator.wikimedia.org/T342214 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jbond@cumin1001 for host sretest1002.eqiad.wmnet with OS bu... [15:45:34] (03CR) 10Elukey: [C: 03+2] knative-serving: removing default logging config [deployment-charts] - 10https://gerrit.wikimedia.org/r/939733 (owner: 10Elukey) [15:45:49] sukhe, jbond https://usercontent.irccloud-cdn.com/file/M2RlEm42/Screenshot%202023-07-19%20at%2017-45-23%20Editing%20IP%20address%2010.65.4.188_16%20NetBox.png [15:46:05] you can try to edit https://netbox-next.wikimedia.org/ipam/ip-addresses/2150/edit/ for example (that's netbox-next) [15:46:13] !log ayounsi@cumin1001 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [15:46:16] rolling to prod [15:46:34] nice [15:48:53] (03CR) 10BCornwall: [V: 03+1 C: 03+2] Remove custom Puppet disable on WDNS reboot (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/939381 (https://phabricator.wikimedia.org/T342182) (owner: 10BCornwall) [15:49:00] (03CR) 10BCornwall: [V: 03+1 C: 03+2] Allow disabling puppet on reboot [cookbooks] - 10https://gerrit.wikimedia.org/r/939377 (https://phabricator.wikimedia.org/T342182) (owner: 10BCornwall) [15:50:41] (03PS1) 10BCornwall: dns: Update the examples docstring to updated name [cookbooks] - 10https://gerrit.wikimedia.org/r/939736 [15:51:06] 10SRE, 10serviceops, 10Abstract Wikipedia team (Phase λ – Launch), 10Patch-For-Review, 10Service-deployment-requests: New Service Request: function-orchestrator and function-evaluator (for Wikifunctions launch) - https://phabricator.wikimedia.org/T297314 (10JMeybohm) https://gerrit.wikimedia.org/r/c/oper... [15:52:44] (03CR) 10Ssingh: [C: 03+1] dns: Update the examples docstring to updated name [cookbooks] - 10https://gerrit.wikimedia.org/r/939736 (owner: 10BCornwall) [15:53:48] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [15:54:28] (03PS1) 10Jbond: sre.hosts.reimage: connect to the micro service port [cookbooks] - 10https://gerrit.wikimedia.org/r/939738 [15:55:34] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [15:56:49] (03CR) 10CI reject: [V: 04-1] sre.hosts.reimage: connect to the micro service port [cookbooks] - 10https://gerrit.wikimedia.org/r/939738 (owner: 10Jbond) [15:57:51] (03PS2) 10Ahmon Dancy: Fix buildkitd.toml.erb [puppet] - 10https://gerrit.wikimedia.org/r/939730 (https://phabricator.wikimedia.org/T329220) [15:58:58] PROBLEM - Wikitech-static main page has content on cloudweb1003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - string Wikitech not found on https://wikitech-static.wikimedia.org:443/wiki/Main_Page?debug=true - 2232 bytes in 8.734 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [15:59:09] (03PS3) 10Ahmon Dancy: Fix buildkitd.toml.erb [puppet] - 10https://gerrit.wikimedia.org/r/939730 (https://phabricator.wikimedia.org/T329220) [15:59:49] (03CR) 10Dduvall: [C: 03+1] Fix buildkitd.toml.erb [puppet] - 10https://gerrit.wikimedia.org/r/939730 (https://phabricator.wikimedia.org/T329220) (owner: 10Ahmon Dancy) [16:00:02] RECOVERY - Wikitech-static main page has content on cloudweb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 26069 bytes in 1.641 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [16:00:03] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:01:06] rzl: Would you be available to deploy https://gerrit.wikimedia.org/r/c/operations/puppet/+/939730 ? [16:02:23] or maybe arnoldokoth ? [16:04:55] (03PS1) 10Jbond: proifile::puppetdb::microservice: add allowed_roles [puppet] - 10https://gerrit.wikimedia.org/r/939741 [16:04:57] (03PS1) 10Jbond: dns::recursor: filter out undef value [puppet] - 10https://gerrit.wikimedia.org/r/939742 [16:05:03] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:05:09] (03PS2) 10Jbond: dns::recursor: filter out undef value [puppet] - 10https://gerrit.wikimedia.org/r/939742 [16:05:23] (03CR) 10CI reject: [V: 04-1] dns::recursor: filter out undef value [puppet] - 10https://gerrit.wikimedia.org/r/939742 (owner: 10Jbond) [16:05:34] (03CR) 10CI reject: [V: 04-1] dns::recursor: filter out undef value [puppet] - 10https://gerrit.wikimedia.org/r/939742 (owner: 10Jbond) [16:06:12] dancy: Yeah, I'm available. [16:06:19] sweet! [16:07:32] (03CR) 10CI reject: [V: 04-1] proifile::puppetdb::microservice: add allowed_roles [puppet] - 10https://gerrit.wikimedia.org/r/939741 (owner: 10Jbond) [16:07:47] arnoldokoth: eoghan saod hje [16:08:03] oops.. eoghan said he'd look at it too. [16:08:34] (03PS3) 10Jbond: dns::recursor: filter out undef value [puppet] - 10https://gerrit.wikimedia.org/r/939742 [16:09:22] dancy: no problem. [16:09:44] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42585/console" [puppet] - 10https://gerrit.wikimedia.org/r/939742 (owner: 10Jbond) [16:11:29] (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42586/console" [puppet] - 10https://gerrit.wikimedia.org/r/939730 (https://phabricator.wikimedia.org/T329220) (owner: 10Ahmon Dancy) [16:14:21] (03PS2) 10Jbond: sre.hosts.reimage: connect to the micro service port [cookbooks] - 10https://gerrit.wikimedia.org/r/939738 [16:14:23] (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (NOOP 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42587/console" [puppet] - 10https://gerrit.wikimedia.org/r/939730 (https://phabricator.wikimedia.org/T329220) (owner: 10Ahmon Dancy) [16:14:31] (03CR) 10Jbond: [C: 04-1] "We first need to update the microservice" [cookbooks] - 10https://gerrit.wikimedia.org/r/939738 (owner: 10Jbond) [16:14:54] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [16:15:27] 10SRE, 10LDAP-Access-Requests: Request for Turnilo Access - https://phabricator.wikimedia.org/T342132 (10Mpossoupe) Tagging @BelindaMbambo as my manager for approval [16:16:59] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS for e4 mgmt entries - cmooney@cumin1001" [16:17:23] (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42588/console" [puppet] - 10https://gerrit.wikimedia.org/r/939730 (https://phabricator.wikimedia.org/T329220) (owner: 10Ahmon Dancy) [16:17:27] (03CR) 10CI reject: [V: 04-1] sre.hosts.reimage: connect to the micro service port [cookbooks] - 10https://gerrit.wikimedia.org/r/939738 (owner: 10Jbond) [16:17:29] 10SRE, 10LDAP-Access-Requests: Request for Turnilo Access - https://phabricator.wikimedia.org/T342132 (10BelindaMbambo) Dear @andrea.denisse I am approving @Mpossoupe for the above request for tunilo , thank you [16:18:42] (03CR) 10Gergő Tisza: [C: 03+1] IP Masking: Enable for cswiki beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/938915 (https://phabricator.wikimedia.org/T342034) (owner: 10Urbanecm) [16:19:08] (03CR) 10EoghanGaffney: [V: 03+1 C: 03+2] Fix buildkitd.toml.erb [puppet] - 10https://gerrit.wikimedia.org/r/939730 (https://phabricator.wikimedia.org/T329220) (owner: 10Ahmon Dancy) [16:19:32] (03PS4) 10Ssingh: dns::recursor: filter out undef value [puppet] - 10https://gerrit.wikimedia.org/r/939742 (owner: 10Jbond) [16:20:22] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS for e4 mgmt entries - cmooney@cumin1001" [16:20:22] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:20:26] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [16:20:35] (03CR) 10Jforrester: wikifunctions: Attempt to write out our main config as JSON (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/938300 (owner: 10Jforrester) [16:20:37] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42589/console" [puppet] - 10https://gerrit.wikimedia.org/r/939742 (owner: 10Jbond) [16:20:45] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [16:21:37] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [16:25:39] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS for e4 mgmt entries - cmooney@cumin1001" [16:26:24] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS for e4 mgmt entries - cmooney@cumin1001" [16:26:24] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:27:34] PROBLEM - Check systemd state on db1108 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-mysqld-exporter@matomo.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:29:52] (03CR) 10Ssingh: [V: 03+1] "https://puppet-compiler.wmflabs.org/output/939742/42593/ NOOP on all DNS hosts" [puppet] - 10https://gerrit.wikimedia.org/r/939742 (owner: 10Jbond) [16:29:56] !log joal@deploy1002 Started deploy [airflow-dags/analytics@4c06501]: Fix bug introduced in cassandra loading jobs [16:30:08] (03CR) 10Ssingh: [V: 03+1] "Thanks for the patch! I am going to merge it as per your permission." [puppet] - 10https://gerrit.wikimedia.org/r/939742 (owner: 10Jbond) [16:30:11] !log joal@deploy1002 Finished deploy [airflow-dags/analytics@4c06501]: Fix bug introduced in cassandra loading jobs (duration: 00m 15s) [16:31:04] (03PS1) 10Ilias Sarantopoulos: ml-services: inference chart change .wiki to reflect wikiID [deployment-charts] - 10https://gerrit.wikimedia.org/r/939744 (https://phabricator.wikimedia.org/T342266) [16:31:46] (03PS2) 10Ilias Sarantopoulos: ml-services: inference chart change .wiki to reflect wikiID [deployment-charts] - 10https://gerrit.wikimedia.org/r/939744 (https://phabricator.wikimedia.org/T342266) [16:31:49] (03CR) 10Ayounsi: [C: 03+1] icinga_exporter: team-tag netops icinga alerts [puppet] - 10https://gerrit.wikimedia.org/r/939695 (owner: 10Filippo Giunchedi) [16:33:01] (03CR) 10Ilias Sarantopoulos: ml-services: inference chart change .wiki to reflect wikiID (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/939744 (https://phabricator.wikimedia.org/T342266) (owner: 10Ilias Sarantopoulos) [16:33:53] (03CR) 10Ssingh: [V: 03+1 C: 03+2] dns::recursor: filter out undef value [puppet] - 10https://gerrit.wikimedia.org/r/939742 (owner: 10Jbond) [16:36:46] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Cyndymediawiksim - https://phabricator.wikimedia.org/T342230 (10DMburugu) I approve the request. [16:37:58] (03PS1) 10Elukey: knative-serving: set a more lenient readiness probe for the webhook [deployment-charts] - 10https://gerrit.wikimedia.org/r/939745 [16:39:01] (03CR) 10Klausman: [V: 03+1 C: 03+1] knative-serving: set a more lenient readiness probe for the webhook [deployment-charts] - 10https://gerrit.wikimedia.org/r/939745 (owner: 10Elukey) [16:40:42] (03CR) 10Elukey: [C: 03+2] knative-serving: set a more lenient readiness probe for the webhook [deployment-charts] - 10https://gerrit.wikimedia.org/r/939745 (owner: 10Elukey) [16:43:21] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [16:44:06] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Cyndymediawiksim - https://phabricator.wikimedia.org/T342230 (10Aklapper) @Cyndymediawiksim I am sorry that I was unclear. "MediaWiki OAuth1 Account" on https://phabricator.wikimedia.org/settings/panel/external/ should ideally link your MediaWiki work acco... [16:44:14] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [16:46:52] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [16:47:43] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [16:50:03] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (PATCH configmaps) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:53:10] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [16:55:03] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (PATCH configmaps) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:55:21] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host druid1009.eqiad.wmnet [16:56:43] !log jforrester@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [16:56:46] !log jforrester@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [16:57:29] (03CR) 10BCornwall: [C: 03+2] dns: Update the examples docstring to updated name [cookbooks] - 10https://gerrit.wikimedia.org/r/939736 (owner: 10BCornwall) [16:57:49] (03CR) 10Jforrester: [C: 03+1] wikifunctions: Update orchestrator and evaluator [deployment-charts] - 10https://gerrit.wikimedia.org/r/939686 (https://phabricator.wikimedia.org/T297314) (owner: 10JMeybohm) [16:57:53] (03CR) 10Jforrester: [C: 03+1] wikifunctions: Enable mesh and ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/939687 (https://phabricator.wikimedia.org/T297314) (owner: 10JMeybohm) [16:58:24] (03CR) 10BCornwall: [V: 03+1 C: 03+2] Remove custom Puppet disable on WDNS reboot (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/939381 (https://phabricator.wikimedia.org/T342182) (owner: 10BCornwall) [17:00:04] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230719T1700) [17:00:48] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host druid1009.eqiad.wmnet [17:02:09] !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox [17:09:20] !log sukhe@cumin2002 START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough and A:wikidough [17:09:42] BGP alerts expected in all sites [17:12:07] (03CR) 10Ssingh: [C: 03+2] dns5004: temporarily remove from authdns_servers for restart [puppet] - 10https://gerrit.wikimedia.org/r/939734 (owner: 10Ssingh) [17:15:34] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host dns5004.wikimedia.org [17:15:44] PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:15:50] PROBLEM - BFD status on cr3-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:16:29] 10SRE, 10ops-eqiad, 10Data-Platform-SRE, 10Infrastructure-Foundations: analytics1073 and analytics1075 - loss of connectivity - https://phabricator.wikimedia.org/T342141 (10BTullis) We've been investigating this extensively and discussing in some depth on #wikimedia-dcops on IRC. We've decided to go ahead... [17:16:30] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host druid1010.eqiad.wmnet [17:17:42] 10SRE, 10ops-eqiad, 10Traffic: Relocate lvs1013-lvs1016 to rows E & F - https://phabricator.wikimedia.org/T341992 (10ayounsi) FYI this Netbox report is alerting: https://netbox.wikimedia.org/extras/reports/results/4808787/#test_port_block_consistency ` xe-0/0/41 [eqiad] Interface type '10gbase-x-sfpp' does n... [17:19:26] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dns5004.wikimedia.org [17:20:16] 10SRE, 10ops-eqiad, 10Traffic: Relocate lvs1013-lvs1016 to rows E & F - https://phabricator.wikimedia.org/T341992 (10cmooney) Thanks @ayounsi @RobH you can probably connect them to 44 and 45 instead. [17:20:46] PROBLEM - Bird Internet Routing Daemon on dns5004 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [17:22:24] PROBLEM - Check systemd state on dns5004 is CRITICAL: CRITICAL - degraded: The following units failed: anycast-healthchecker.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:22:47] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host druid1010.eqiad.wmnet [17:23:01] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host druid1011.eqiad.wmnet [17:23:30] PROBLEM - Check if anycast-healthchecker and all configured threads are running on dns5004 is CRITICAL: CRITICAL: anycast-healthchecker could be down as pid file /var/run/anycast-healthchecker/anycast-healthchecker.pid doesnt exist https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [17:25:15] 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack C8 - https://phabricator.wikimedia.org/T308339 (10RobH) [17:27:01] (03PS1) 10Ssingh: Revert "dns5004: temporarily remove from authdns_servers for restart" [puppet] - 10https://gerrit.wikimedia.org/r/939343 [17:27:40] RECOVERY - BFD status on cr2-eqsin is OK: UP: 14 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:27:46] RECOVERY - BFD status on cr3-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:27:56] (03CR) 10Ssingh: [C: 03+2] Revert "dns5004: temporarily remove from authdns_servers for restart" [puppet] - 10https://gerrit.wikimedia.org/r/939343 (owner: 10Ssingh) [17:27:58] RECOVERY - Check if anycast-healthchecker and all configured threads are running on dns5004 is OK: OK: UP (pid=4539) and all threads (2) are running https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [17:28:14] RECOVERY - Bird Internet Routing Daemon on dns5004 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [17:28:24] RECOVERY - Check systemd state on dns5004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:29:26] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host druid1011.eqiad.wmnet [17:32:47] !log dummy run of authdns-update [17:32:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:42] (03PS1) 10Eevans: deployment-prep: Upgrade restbase04 to Cassandra 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/939750 (https://phabricator.wikimedia.org/T313814) [17:38:08] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:38:22] ^ expected [17:38:50] (03CR) 10Eevans: [C: 03+2] deployment-prep: Upgrade restbase04 to Cassandra 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/939750 (https://phabricator.wikimedia.org/T313814) (owner: 10Eevans) [17:38:59] PROBLEM - Host db1218 #page is DOWN: PING CRITICAL - Packet loss = 100% [17:39:27] er [17:39:47] I am depooling [17:39:57] thanks [17:40:09] !log depool db1218 [17:40:10] PROBLEM - BGP status on cr3-esams is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:40:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:20] !log sukhe@cumin2002 dbctl commit (dc=all): 'Depool db1218', diff saved to https://phabricator.wikimedia.org/P49603 and previous config saved to /var/cache/conftool/dbconfig/20230719-174019-sukhe.json [17:40:38] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Cyndymediawiksim - https://phabricator.wikimedia.org/T342230 (10Cyndymediawiksim) >>! In T342230#9028928, @Aklapper wrote: > @Cyndymediawiksim I am sorry that I was unclear. "MediaWiki OAuth1 Account" on https://phabricator.wikimedia.org/settings/panel/ext... [17:41:02] herron: done [17:41:11] ack, thank you sukhe [17:41:12] !log robh@cumin1001 START - Cookbook sre.hosts.reimage for host lvs1013.eqiad.wmnet with OS bullseye [17:41:23] 10SRE, 10ops-eqiad, 10Traffic: Relocate lvs1013-lvs1016 to rows E & F - https://phabricator.wikimedia.org/T341992 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin1001 for host lvs1013.eqiad.wmnet with OS bullseye [17:41:24] not sure who to make aware of this [17:41:28] 10SRE, 10ops-knams, 10DC-Ops: Relocate one of the mx480 from knams to esams - https://phabricator.wikimedia.org/T342198 (10wiki_willy) Cool, thanks for confirming @Papaul. Hopefully Iron Mountain will come back with the same confirmation as well. [17:41:40] RECOVERY - BGP status on cr3-esams is OK: BGP OK - up: 20, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:42:09] sukhe: Amir1 would be a DBA around [17:42:23] sukhe: what's up? [17:42:30] Amir1: [17:42:30] 13:38:58 <+icinga-wm> PROBLEM - Host db1218 #page is DOWN: PING CRITICAL - Packet loss = 100% [17:42:33] depooled it [17:42:37] sigh, thanks [17:42:51] That's candidate master for s1 Amir1 [17:43:05] yes, I'm aware [17:43:26] 10SRE, 10ops-eqiad, 10Traffic: Relocate lvs1013-lvs1016 to rows E & F - https://phabricator.wikimedia.org/T341992 (10RobH) >>! In T341992#9026925, @Vgutierrez wrote: > @RobH I'm seeing on cumin1001 logs, that you interrupted the reimage of lvs1013 by pressing Ctrl+C: > ` > 2023-07-18 16:01:28,549 robh 203485... [17:44:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on db1218.eqiad.wmnet with reason: Maint [17:44:32] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1218.eqiad.wmnet with reason: Maint [17:45:01] I can't ssh into it, gonna do a powercycle [17:45:03] 10SRE, 10ops-eqiad, 10Traffic: Relocate lvs1013-lvs1016 to rows E & F - https://phabricator.wikimedia.org/T341992 (10RobH) a:03RobH >>! In T341992#9029076, @ayounsi wrote: > FYI this Netbox report is alerting: > https://netbox.wikimedia.org/extras/reports/results/4808787/#test_port_block_consistency > ` >... [17:45:40] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:45:44] (03PS1) 10Ayounsi: Add includes for lsw1-e8-eqiad v6 PTR records [dns] - 10https://gerrit.wikimedia.org/r/939752 [17:46:02] 10SRE, 10ops-eqiad, 10Traffic: Relocate lvs1013-lvs1016 to rows E & F - https://phabricator.wikimedia.org/T341992 (10RobH) Ok, the Bullseye OS has issues with the drivers for some of the hardware... Considering these are R430s, I don't think it is worth putting in time to install support for them in Bullsey... [17:46:39] (03CR) 10CI reject: [V: 04-1] Add includes for lsw1-e8-eqiad v6 PTR records [dns] - 10https://gerrit.wikimedia.org/r/939752 (owner: 10Ayounsi) [17:49:43] !log powercycled db1218 (T342284) [17:49:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:47] T342284: db1218 crashed - https://phabricator.wikimedia.org/T342284 [17:50:54] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:50:57] !log robh@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host lvs1013.eqiad.wmnet with OS bullseye [17:51:12] 10SRE, 10ops-eqiad, 10Traffic: Relocate lvs1013-lvs1016 to rows E & F - https://phabricator.wikimedia.org/T341992 (10ayounsi) @RobH they will need to have their switch port moved. On QFX5120s, if one port is configured at 1G, the 3 other adjacent ports can only be 1G. Here port 40 and port 42 are configure... [17:51:46] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:53:17] RECOVERY - Host db1218 #page is UP: PING OK - Packet loss = 0%, RTA = 0.43 ms [17:53:44] 10SRE, 10ops-knams, 10DC-Ops: Relocate one of the mx480 from esams to knams - https://phabricator.wikimedia.org/T342198 (10Papaul) [17:53:50] (03PS2) 10Ayounsi: Add includes for lsw1-e8-eqiad PTR records [dns] - 10https://gerrit.wikimedia.org/r/939752 [17:53:56] up now [17:54:06] good ol' power cycle [17:54:10] 10SRE, 10ops-knams, 10DC-Ops: Relocate one of the mx480 from esams to knams - https://phabricator.wikimedia.org/T342198 (10Papaul) @wiki_willy i think i made a mistake that i just fixed that confirmation is from esams not from knams. thanks [17:54:20] racadm logs say something? [17:55:02] (03CR) 10Cathal Mooney: "LGTM!" [dns] - 10https://gerrit.wikimedia.org/r/939752 (owner: 10Ayounsi) [17:55:18] (03CR) 10Cathal Mooney: [C: 03+1] Add includes for lsw1-e8-eqiad PTR records [dns] - 10https://gerrit.wikimedia.org/r/939752 (owner: 10Ayounsi) [17:55:19] (03CR) 10Ayounsi: [C: 03+2] Add includes for lsw1-e8-eqiad PTR records [dns] - 10https://gerrit.wikimedia.org/r/939752 (owner: 10Ayounsi) [17:56:54] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:57:48] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:58:10] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [17:59:46] sukhe: ^ 😬 [18:00:04] dancy and dduvall: Dear deployers, time to do the Train log triage with CPT deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230719T1800). [18:00:04] dancy and dduvall: Dear deployers, time to do the MediaWiki train - Utc-7 Version deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230719T1800). [18:00:08] XioNoX: do you mean the BGP alerts or the DNS one! [18:00:34] sukhe: the cookbook I forgot [18:00:52] ah ha [18:01:52] (03CR) 10BCornwall: "The naming convention would make this file roll-restart-reboot-durum.py. Not a good naming convention for sure, but that's what the other " [cookbooks] - 10https://gerrit.wikimedia.org/r/939704 (owner: 10Ssingh) [18:04:17] (03PS3) 10Ssingh: sre.dns: add a new cookbook for durum reboot/service restarts [cookbooks] - 10https://gerrit.wikimedia.org/r/939704 [18:06:46] PROBLEM - BFD status on cr3-eqsin is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:07:07] (03CR) 10Ssingh: sre.dns: add a new cookbook for durum reboot/service restarts (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/939704 (owner: 10Ssingh) [18:07:54] (03CR) 10Ssingh: sre.dns: add a new cookbook for durum reboot/service restarts (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/939704 (owner: 10Ssingh) [18:08:14] RECOVERY - BFD status on cr3-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:11:56] 10SRE, 10ops-eqiad, 10Traffic: Relocate lvs1013-lvs1016 to rows E & F - https://phabricator.wikimedia.org/T341992 (10RobH) >>! In T341992#9029218, @ayounsi wrote: > @RobH they will need to have their switch port moved. > > On QFX5120s, if one port is configured at 1G, the 3 other adjacent ports can only be... [18:13:38] It's train time! [18:14:31] (03PS1) 10TrainBranchBot: group1 wikis to 1.41.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939753 (https://phabricator.wikimedia.org/T340246) [18:14:33] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.41.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939753 (https://phabricator.wikimedia.org/T340246) (owner: 10TrainBranchBot) [18:15:52] (03Merged) 10jenkins-bot: group1 wikis to 1.41.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939753 (https://phabricator.wikimedia.org/T340246) (owner: 10TrainBranchBot) [18:22:08] PROBLEM - BGP status on asw1-b12-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv6: Connect - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:22:28] (03PS2) 10ArielGlenn: Make sure that rsync runs only on the primary dumps nfs shares [puppet] - 10https://gerrit.wikimedia.org/r/939674 (https://phabricator.wikimedia.org/T325232) [18:22:51] (03CR) 10CI reject: [V: 04-1] Make sure that rsync runs only on the primary dumps nfs shares [puppet] - 10https://gerrit.wikimedia.org/r/939674 (https://phabricator.wikimedia.org/T325232) (owner: 10ArielGlenn) [18:23:34] (KubernetesAPILatency) firing: High Kubernetes API latency (PUT secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:23:36] RECOVERY - BGP status on asw1-b12-drmrs.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:23:46] (03PS1) 10Krinkle: Profiler: Remove "toobig" filter from Arc Lamp ingestion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939755 (https://phabricator.wikimedia.org/T337873) [18:23:48] (03PS1) 10Krinkle: Profiler: Sync minor changes with arc-lamp.git package [mediawiki-config] - 10https://gerrit.wikimedia.org/r/939756 (https://phabricator.wikimedia.org/T337873) [18:24:00] Does anyone know what's up with parse1002? [18:24:22] dancy: https://sal.toolforge.org/production?p=0&q=parse1002&d= [18:24:37] looks like it had an issue this morning [18:24:41] what are you seeing now? [18:24:57] ssh connections timing out. [18:25:03] (during train deployment) [18:25:23] Unresponsive to ping [18:25:34] !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.41.0-wmf.18 refs T340246 [18:25:38] T340246: 1.41.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T340246 [18:28:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:29:32] !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough and A:wikidough [18:31:06] (03CR) 10BCornwall: [V: 03+1 C: 03+1] "Looks great, and runs as expected!" [cookbooks] - 10https://gerrit.wikimedia.org/r/939704 (owner: 10Ssingh) [18:32:40] (03CR) 10BCornwall: [V: 03+1 C: 03+1] sre.dns: add a new cookbook for durum reboot/service restarts (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/939704 (owner: 10Ssingh) [18:33:13] (03CR) 10Ssingh: sre.dns: add a new cookbook for durum reboot/service restarts (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/939704 (owner: 10Ssingh) [18:36:39] (03CR) 10Ssingh: [C: 03+2] sre.dns: add a new cookbook for durum reboot/service restarts [cookbooks] - 10https://gerrit.wikimedia.org/r/939704 (owner: 10Ssingh) [18:41:18] !log sukhe@cumin2002 START - Cookbook sre.dns.roll-restart-reboot-durum rolling reboot on A:durum and A:durum [18:43:40] (03PS1) 10Jforrester: Remove wikifunctions.org Varnish 302 [puppet] - 10https://gerrit.wikimedia.org/r/939757 (https://phabricator.wikimedia.org/T275945) [18:45:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frbast1002, frmon1002, frpig1002 - https://phabricator.wikimedia.org/T319460 (10Dwisehaupt) [18:45:51] (03PS3) 10ArielGlenn: Make sure that rsync runs only on the primary dumps nfs shares [puppet] - 10https://gerrit.wikimedia.org/r/939674 (https://phabricator.wikimedia.org/T325232) [18:46:09] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frbast1002, frmon1002, frpig1002 - https://phabricator.wikimedia.org/T319460 (10Dwisehaupt) 05Open→03Resolved [18:49:16] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:49:22] PROBLEM - BFD status on cr2-codfw is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:49:32] expected BGP alerts in all sites [18:49:44] durum restart, I am monitoring in case something else comes up [18:50:48] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 111, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:50:54] RECOVERY - BFD status on cr2-codfw is OK: UP: 20 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:54:00] (03CR) 10JHathaway: "Would it be worth considering a systemd adhoc timer that would trigger some time after the puppet run is complete? e.g. systemd-run --on-a" [puppet] - 10https://gerrit.wikimedia.org/r/939643 (https://phabricator.wikimedia.org/T330490) (owner: 10Jbond) [18:57:15] 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10RobH) [18:58:00] 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10RobH) [18:58:51] !log sukhe@cumin2002 START - Cookbook sre.hosts.reboot-single for host pybal-test2003.codfw.wmnet [19:03:30] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host pybal-test2003.codfw.wmnet [19:04:48] PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:06:16] RECOVERY - BFD status on cr1-eqiad is OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:15:38] PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:15:50] PROBLEM - BFD status on cr3-eqsin is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:17:06] RECOVERY - BFD status on cr2-eqsin is OK: UP: 14 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:17:18] RECOVERY - BFD status on cr3-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:21:36] PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:21:48] PROBLEM - BFD status on cr3-eqsin is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:23:06] RECOVERY - BFD status on cr2-eqsin is OK: UP: 14 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:23:18] RECOVERY - BFD status on cr3-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:24:53] !log bking@cumin1001 START - Cookbook sre.hosts.decommission for hosts flink-zk1003.eqiad.wmnet [19:26:02] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:27:58] PROBLEM - BFD status on cr2-esams is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:28:28] PROBLEM - BFD status on cr3-esams is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:28:55] !log bking@cumin1001 START - Cookbook sre.dns.netbox [19:29:28] RECOVERY - BFD status on cr2-esams is OK: UP: 18 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:29:58] RECOVERY - BFD status on cr3-esams is OK: UP: 17 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:32:00] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:32:40] (03PS4) 10ArielGlenn: Make sure that rsync runs only on the primary dumps nfs shares [puppet] - 10https://gerrit.wikimedia.org/r/939674 (https://phabricator.wikimedia.org/T325232) [19:33:56] PROBLEM - BFD status on cr2-esams is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:34:26] PROBLEM - BFD status on cr3-esams is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:34:32] wish there was a way to silence these alerts [19:34:35] but alas [19:34:50] in some ways silencing them is not desirable [19:35:25] !log bking@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: flink-zk1003.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin1001" [19:35:26] RECOVERY - BFD status on cr2-esams is OK: UP: 18 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:35:56] RECOVERY - BFD status on cr3-esams is OK: UP: 17 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:36:40] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:36:43] !log bking@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: flink-zk1003.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin1001" [19:36:43] !log bking@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:36:44] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts flink-zk1003.eqiad.wmnet [19:36:49] 10SRE, 10Data-Platform-SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by bking@cumin1001 for hosts: `flink-zk1003.eqiad.wmnet` - flink-zk1003.eqiad.wmnet... [19:37:16] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast, AS64605/IPv6: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:37:20] !log bking@cumin1001 START - Cookbook sre.ganeti.makevm for new host flink-zk1003.eqiad.wmnet [19:37:22] !log bking@cumin1001 START - Cookbook sre.dns.netbox [19:39:43] !log bking@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM flink-zk1003.eqiad.wmnet - bking@cumin1001" [19:40:29] !log bking@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM flink-zk1003.eqiad.wmnet - bking@cumin1001" [19:40:29] !log bking@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:40:29] !log bking@cumin1001 START - Cookbook sre.dns.wipe-cache flink-zk1003.eqiad.wmnet on all recursors [19:40:32] !log bking@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) flink-zk1003.eqiad.wmnet on all recursors [19:40:58] !log bking@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM flink-zk1003.eqiad.wmnet - bking@cumin1001" [19:41:42] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast, AS64605/IPv6: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:41:42] !log bking@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM flink-zk1003.eqiad.wmnet - bking@cumin1001" [19:42:08] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host flink-zk1003.eqiad.wmnet with OS bookworm [19:42:16] 10SRE, 10Data-Platform-SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host flink-zk1003.eqiad.wmnet with OS bookworm [19:42:18] (03PS3) 10Hubaishan: Replace underscores with spaces in 4 Arabic sitenames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927713 (https://phabricator.wikimedia.org/T337725) [19:42:36] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:45:20] !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling reboot on A:durum and A:durum [19:47:40] that should be all the bgp alerts [19:50:42] (03PS5) 10ArielGlenn: Make sure that rsync runs only on the primary dumps nfs shares [puppet] - 10https://gerrit.wikimedia.org/r/939674 (https://phabricator.wikimedia.org/T325232) [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230719T2000). [20:00:05] hubaishan and kimberly_sarabia: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:19] hello [20:01:02] I can deploy! [20:01:10] kimberly_sarabia: I'll do yours first, as its a beta-only change :) [20:01:20] ty [20:01:26] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936333 (https://phabricator.wikimedia.org/T337956) (owner: 10Kimberly Sarabia) [20:01:50] (03PS1) 10Eevans: cassandra: prevent malformed config when tls_cluster_name is unset [puppet] - 10https://gerrit.wikimedia.org/r/939763 (https://phabricator.wikimedia.org/T313814) [20:02:09] (03Merged) 10jenkins-bot: Turn off A/B Test in Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/936333 (https://phabricator.wikimedia.org/T337956) (owner: 10Kimberly Sarabia) [20:03:29] kimberly_sarabia: done, it'll be live on beta in a few minutes :) [20:03:54] TheresNoTime: Ok! tysm [20:04:28] (03PS2) 10Eevans: cassandra: prevent malformed config when tls_cluster_name is unset [puppet] - 10https://gerrit.wikimedia.org/r/939763 (https://phabricator.wikimedia.org/T313814) [20:04:44] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/939763 (https://phabricator.wikimedia.org/T313814) (owner: 10Eevans) [20:05:39] Hello [20:06:16] (03CR) 10Samtar: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927713 (https://phabricator.wikimedia.org/T337725) (owner: 10Hubaishan) [20:06:19] hubaishan: hi! Just looking at your patch now :) have you done a backport before? [20:06:39] No [20:07:34] No problem :) first things first, have you read https://wikitech.wikimedia.org/wiki/Backport_windows#Doing_the_deploy and do you have https://wikitech.wikimedia.org/wiki/WikimediaDebug installed? [20:08:54] (03PS4) 10Samtar: Replace underscores with spaces in 4 Arabic sitenames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927713 (https://phabricator.wikimedia.org/T337725) (owner: 10Hubaishan) [20:09:41] https://wikitech.wikimedia.org/wiki/WikimediaDebug installed is installed [20:10:08] great, let's start :) I'll let you know when to test, and we can look at testing it together [20:10:11] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927713 (https://phabricator.wikimedia.org/T337725) (owner: 10Hubaishan) [20:10:55] (03Merged) 10jenkins-bot: Replace underscores with spaces in 4 Arabic sitenames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/927713 (https://phabricator.wikimedia.org/T337725) (owner: 10Hubaishan) [20:11:24] !log samtar@deploy1002 Started scap: Backport for [[gerrit:927713|Replace underscores with spaces in 4 Arabic sitenames (T337725)]] [20:11:27] T337725: Replace underscores with spaces in Arabic Wikimedia project sitenames - https://phabricator.wikimedia.org/T337725 [20:12:58] !log samtar@deploy1002 samtar and hubaishan: Backport for [[gerrit:927713|Replace underscores with spaces in 4 Arabic sitenames (T337725)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [20:13:57] hubaishan: okay, that change is live on mwdebug. You can use the WikimediaDebug extension to pick any of the `mwdebug` servers and test. For example, I can see that https://ar.wikisource.org/wiki/%D9%85%D8%B3%D8%AA%D8%AE%D8%AF%D9%85:TheresNoTime/Test normally shows `ويكي_مصدر`, but when using mwdebug, it shows `ويكي مصدر` (after resaving the page) [20:14:59] (that's seeing the output of `{{SITENAME}}` by the way) [20:15:24] Once you're happy that your patch works as expected, let me know and we can sync it :) [20:16:01] It is Good :] [20:16:36] Awesome, syncing now [20:16:42] Once that [20:17:03] Once that's done, which can take a while, I'll let you know and you can test again without using `mwdebug` :) [20:22:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:22:38] Noting that I've had 1 failure during `sync-apaches`, logged at https://phabricator.wikimedia.org/P49604 [20:22:39] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [20:24:30] same host during `scap-cdb-rebuild`, — `parse1002.eqiad.wmnet` [20:24:55] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add reverse dns for spine linknets eqiad - cmooney@cumin1001" [20:25:26] TheresNoTime: I think parse1002 needs a depool [20:25:32] It had issues during the train [20:26:29] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add reverse dns for spine linknets eqiad - cmooney@cumin1001" [20:26:29] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:27:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:28:33] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:927713|Replace underscores with spaces in 4 Arabic sitenames (T337725)]] (duration: 17m 09s) [20:28:36] T337725: Replace underscores with spaces in Arabic Wikimedia project sitenames - https://phabricator.wikimedia.org/T337725 [20:28:37] hubaishan: okay, can you test again, but this time make sure the `mwdebug` toggle is switched off :) [20:29:42] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install titan100[12] - https://phabricator.wikimedia.org/T342179 (10RobH) [20:31:35] !log backport window closed [20:31:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:21] TheresNoTime it is OK. [20:33:37] !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host flink-zk1003.eqiad.wmnet with OS bookworm [20:33:37] !log bking@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host flink-zk1003.eqiad.wmnet [20:33:39] great, all done then! :) Thank you for the patch [20:33:43] 10SRE, 10Data-Platform-SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host flink-zk1003.eqiad.wmnet with OS bookworm executed w... [20:34:14] 10ops-codfw, 10DC-Ops, 10observability: Q1:rack/setup/install titan200[12] - https://phabricator.wikimedia.org/T342300 (10RobH) [20:34:22] 10ops-codfw, 10DC-Ops, 10observability: Q1:rack/setup/install titan200[12] - https://phabricator.wikimedia.org/T342300 (10RobH) [20:38:01] !log bking@cumin1001 START - Cookbook sre.ganeti.makevm for new host flink-zk1003.eqiad.wmnet [20:38:02] !log bking@cumin1001 START - Cookbook sre.dns.netbox [20:39:03] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: Cabling for Eqiad racke E5-7 and F5-7 - https://phabricator.wikimedia.org/T334231 (10cmooney) @Jclark-ctr my apologies for some reason I thought these links had been cabled but seems from T338789 I didn't update the optic type so we need got them... [20:39:24] !log bking@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [20:39:26] !log bking@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host flink-zk1003.eqiad.wmnet [20:39:34] !log bking@cumin1001 START - Cookbook sre.hosts.decommission for hosts flink-zk1003.eqiad.wmnet [20:43:29] !log bking@cumin1001 START - Cookbook sre.dns.netbox [20:54:07] !log bking@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: flink-zk1003.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin1001" [20:55:00] !log bking@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: flink-zk1003.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin1001" [20:55:00] !log bking@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:55:01] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts flink-zk1003.eqiad.wmnet [20:55:08] 10SRE, 10Data-Platform-SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by bking@cumin1001 for hosts: `flink-zk1003.eqiad.wmnet` - flink-zk1003.eqiad.wmnet... [21:00:05] Deploy window Wikifunction Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230719T2100) [21:20:01] !log bking@deploy1002 Started deploy [wdqs/wdqs@dff41b7]: 0.3.124 [21:21:49] !log bking@deploy1002 Finished deploy [wdqs/wdqs@dff41b7]: 0.3.124 (duration: 01m 47s) [21:26:21] !log bking@deploy1002 Started deploy [wdqs/wdqs@dff41b7]: 0.3.124 [21:27:26] !log bking@deploy1002 Finished deploy [wdqs/wdqs@dff41b7]: 0.3.124 (duration: 01m 05s) [21:32:55] !log bking@cumin1001 START - Cookbook sre.ganeti.makevm for new host flink-zk1003.eqiad.wmnet [21:32:56] !log bking@cumin1001 START - Cookbook sre.dns.netbox [21:36:43] !log bking@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM flink-zk1003.eqiad.wmnet - bking@cumin1001" [21:37:29] !log bking@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM flink-zk1003.eqiad.wmnet - bking@cumin1001" [21:37:29] !log bking@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:37:29] !log bking@cumin1001 START - Cookbook sre.dns.wipe-cache flink-zk1003.eqiad.wmnet on all recursors [21:37:32] !log bking@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) flink-zk1003.eqiad.wmnet on all recursors [21:37:59] !log bking@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM flink-zk1003.eqiad.wmnet - bking@cumin1001" [21:38:43] !log bking@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM flink-zk1003.eqiad.wmnet - bking@cumin1001" [21:41:00] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host flink-zk1003.eqiad.wmnet with OS bookworm [21:41:08] 10SRE, 10Data-Platform-SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host flink-zk1003.eqiad.wmnet with OS bookworm [21:42:00] 10SRE, 10Traffic, 10cloud-services-team: Rename references of labweb to cloudweb - https://phabricator.wikimedia.org/T317463 (10BCornwall) I would think that this needs to be followed since it's technically a new service even it's a rename. For instance, the dns repo still has "labweb" in templates/wmnet. A... [21:44:40] 10SRE, 10Traffic, 10cloud-services-team: Rename references of labweb to cloudweb - https://phabricator.wikimedia.org/T317463 (10taavi) Yeah, the above patches were just getting rid of the non-TLS endpoint so we have one service to rename instead of two. The actual rename still needs to be done. [22:08:55] 10SRE, 10SRE-Access-Requests: Requesting access to ops for taavi - https://phabricator.wikimedia.org/T342307 (10taavi) [22:10:08] PROBLEM - Router interfaces on cr2-drmrs is CRITICAL: CRITICAL: host 185.15.58.129, interfaces up: 61, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:10:46] PROBLEM - Router interfaces on cr3-esams is CRITICAL: CRITICAL: host 91.198.174.245, interfaces up: 83, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:36:26] !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host flink-zk1003.eqiad.wmnet with OS bookworm [22:36:27] !log bking@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host flink-zk1003.eqiad.wmnet [22:36:33] 10SRE, 10Data-Platform-SRE, 10vm-requests, 10Discovery-Search (Current work): eqiad: 3 VMs requested for Zookeeper - https://phabricator.wikimedia.org/T341705 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host flink-zk1003.eqiad.wmnet with OS bookworm executed w... [23:57:06] RECOVERY - Router interfaces on cr3-esams is OK: OK: host 91.198.174.245, interfaces up: 84, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:57:50] RECOVERY - Router interfaces on cr2-drmrs is OK: OK: host 185.15.58.129, interfaces up: 62, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down