[02:06:45] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:09:28] 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T323961 (10Andrew) @Jclark-ctr it'll be another week or two before we have workloads moved off of this. [02:26:45] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:29:39] (NodeTextfileStale) firing: Stale textfile for puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [05:10:09] (03PS2) 10KartikMistry: Update cxserver to 2023-03-17-133444-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/900960 (https://phabricator.wikimedia.org/T332379) [05:13:01] (03PS3) 10Marostegui: mariadb: Promote db1101 to m1 master [puppet] - 10https://gerrit.wikimedia.org/r/902572 (https://phabricator.wikimedia.org/T331510) [05:14:21] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db[2132,2160].codfw.wmnet,db[1101,1117,1164].eqiad.wmnet with reason: m1 master switch T331510 [05:14:27] T331510: Switchover m1 master (db1164 -> db1101) - https://phabricator.wikimedia.org/T331510 [05:14:37] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db[2132,2160].codfw.wmnet,db[1101,1117,1164].eqiad.wmnet with reason: m1 master switch T331510 [05:16:56] Updating cxserver, minor changes. [05:18:10] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2023-03-17-133444-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/900960 (https://phabricator.wikimedia.org/T332379) (owner: 10KartikMistry) [05:19:04] (03PS1) 10Marostegui: db1179: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/902902 (https://phabricator.wikimedia.org/T332292) [05:19:35] (03PS15) 10KartikMistry: WIP: Add new self hosted machinetranslation service (MinT) [deployment-charts] - 10https://gerrit.wikimedia.org/r/897634 (https://phabricator.wikimedia.org/T331505) [05:19:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1120 T332292', diff saved to https://phabricator.wikimedia.org/P45942 and previous config saved to /var/cache/conftool/dbconfig/20230327-051941-root.json [05:19:46] T332292: Move db1179 to x1 - https://phabricator.wikimedia.org/T332292 [05:19:53] (03CR) 10Marostegui: [C: 03+2] db1179: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/902902 (https://phabricator.wikimedia.org/T332292) (owner: 10Marostegui) [05:22:56] (03Merged) 10jenkins-bot: Update cxserver to 2023-03-17-133444-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/900960 (https://phabricator.wikimedia.org/T332379) (owner: 10KartikMistry) [05:23:42] (03PS1) 10Marostegui: mariadb: Move db1179 to x1 [puppet] - 10https://gerrit.wikimedia.org/r/902903 (https://phabricator.wikimedia.org/T332292) [05:23:47] !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply [05:24:14] (03CR) 10Marostegui: [C: 03+2] mariadb: Move db1179 to x1 [puppet] - 10https://gerrit.wikimedia.org/r/902903 (https://phabricator.wikimedia.org/T332292) (owner: 10Marostegui) [05:24:27] !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [05:28:00] !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [05:28:52] !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [05:37:57] !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/cxserver: apply [05:38:42] !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [05:40:49] !log Updated cxserver to 2023-03-17-133444-production (T332379 + build changes) [05:40:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:40:54] T332379: Post-creation work for anpwiki - https://phabricator.wikimedia.org/T332379 [05:57:34] (03PS1) 10KartikMistry: Enable Section Translation on some wikis while Content Translation remains in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903003 (https://phabricator.wikimedia.org/T308834) [06:19:47] (03CR) 10Krinkle: Fix PHP string interpolation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868528 (https://phabricator.wikimedia.org/T314096) (owner: 10Reedy) [06:29:39] (NodeTextfileStale) firing: Stale textfile for puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [06:36:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P45944 and previous config saved to /var/cache/conftool/dbconfig/20230327-063642-root.json [06:40:20] !log Rename flaggedrevs tables on db1123 ptwikisource T332594 [06:40:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:40:25] T332594: Drop FlaggedRevs tables in database for ptwikisource - https://phabricator.wikimedia.org/T332594 [06:51:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P45945 and previous config saved to /var/cache/conftool/dbconfig/20230327-065147-root.json [06:51:53] !log dbmaint s3 eqiad Rename flaggedrevs tables on db1123 ptwikisource T332594 [06:51:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:51:58] T332594: Drop FlaggedRevs tables in database for ptwikisource - https://phabricator.wikimedia.org/T332594 [06:54:22] (03PS1) 10Ayounsi: Add cookbook to configure router's BGP sessions to k8s hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/903174 (https://phabricator.wikimedia.org/T306649) [07:00:05] Amir1, Urbanecm, and taavi: #bothumor I � Unicode. All rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230327T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:06:49] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:06:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P45946 and previous config saved to /var/cache/conftool/dbconfig/20230327-070651-root.json [07:07:57] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:09:15] (03PS1) 10Marostegui: backups: Replace db1164 with db1101 [puppet] - 10https://gerrit.wikimedia.org/r/903175 (https://phabricator.wikimedia.org/T331510) [07:12:08] jynus: also that one ^ :) [07:12:19] oh [07:12:29] I forgot [07:12:46] needs 2 changes actually [07:12:59] ah yes [07:13:00] I see it [07:13:02] let me fix it [07:13:27] (03PS2) 10Marostegui: backups: Replace db1164 with db1101 [puppet] - 10https://gerrit.wikimedia.org/r/903175 (https://phabricator.wikimedia.org/T331510) [07:13:29] jynus: ^ [07:13:41] (03PS4) 10Marostegui: mariadb: Promote db1101 to m1 master [puppet] - 10https://gerrit.wikimedia.org/r/902572 (https://phabricator.wikimedia.org/T331510) [07:13:47] (03CR) 10Jcrespo: [C: 03+1] backups: Replace db1164 with db1101 [puppet] - 10https://gerrit.wikimedia.org/r/903175 (https://phabricator.wikimedia.org/T331510) (owner: 10Marostegui) [07:14:06] one sec because I was looking and there are backups still running [07:14:27] sure no problem [07:21:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P45947 and previous config saved to /var/cache/conftool/dbconfig/20230327-072156-root.json [07:30:28] (03CR) 10Jcrespo: [C: 03+1] mariadb: Promote db1101 to m1 master [puppet] - 10https://gerrit.wikimedia.org/r/902572 (https://phabricator.wikimedia.org/T331510) (owner: 10Marostegui) [07:32:29] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1101 to m1 master [puppet] - 10https://gerrit.wikimedia.org/r/902572 (https://phabricator.wikimedia.org/T331510) (owner: 10Marostegui) [07:33:11] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 (10Joe) [07:34:11] * urbanecm goes to do some MW deployment, since B&C is empty [07:34:16] (03CR) 10Urbanecm: [C: 03+2] SpecialWikiSets: Avoid calling WikiSet::getId on null [extensions/CentralAuth] (wmf/1.41.0-wmf.1) - 10https://gerrit.wikimedia.org/r/902741 (https://phabricator.wikimedia.org/T333075) (owner: 10Urbanecm) [07:34:32] (03CR) 10Urbanecm: [C: 03+2] GrowthMentors.json: Add a write-only username field [extensions/GrowthExperiments] (wmf/1.41.0-wmf.1) - 10https://gerrit.wikimedia.org/r/902734 (https://phabricator.wikimedia.org/T331444) (owner: 10Urbanecm) [07:36:40] (03Merged) 10jenkins-bot: SpecialWikiSets: Avoid calling WikiSet::getId on null [extensions/CentralAuth] (wmf/1.41.0-wmf.1) - 10https://gerrit.wikimedia.org/r/902741 (https://phabricator.wikimedia.org/T333075) (owner: 10Urbanecm) [07:37:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P45948 and previous config saved to /var/cache/conftool/dbconfig/20230327-073701-root.json [07:38:50] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:902741|SpecialWikiSets: Avoid calling WikiSet::getId on null (T333075)]] [07:38:58] T333075: Error: Call to a member function getId() on null - https://phabricator.wikimedia.org/T333075 [07:39:50] !log disabling puppet and shutding down bacula at backup1001 T331510 [07:39:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:55] T331510: Switchover m1 master (db1164 -> db1101) - https://phabricator.wikimedia.org/T331510 [07:41:52] a prometheus availability job will alert because of the above log, as the job only monitors that 1 host [07:44:25] (03PS1) 10Jcrespo: bacula: Add miscweb2003 jobs to the list of monitoring-ignored jobs [puppet] - 10https://gerrit.wikimedia.org/r/903179 (https://phabricator.wikimedia.org/T331896) [07:46:45] (JobUnavailable) firing: Reduced availability for job bacula in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:48:21] !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:902741|SpecialWikiSets: Avoid calling WikiSet::getId on null (T333075)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [07:48:26] T333075: Error: Call to a member function getId() on null - https://phabricator.wikimedia.org/T333075 [07:48:39] 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops, 10netops: Please update bootp helper on pfw3-eqiad to point to frpm1002 for fundraising subnets - https://phabricator.wikimedia.org/T332939 (10ayounsi) 05Open→03Resolved a:03ayounsi Done! [07:51:39] (03CR) 10Jcrespo: [C: 03+2] bacula: Add miscweb2003 jobs to the list of monitoring-ignored jobs [puppet] - 10https://gerrit.wikimedia.org/r/903179 (https://phabricator.wikimedia.org/T331896) (owner: 10Jcrespo) [07:52:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P45949 and previous config saved to /var/cache/conftool/dbconfig/20230327-075206-root.json [07:52:55] (03Merged) 10jenkins-bot: GrowthMentors.json: Add a write-only username field [extensions/GrowthExperiments] (wmf/1.41.0-wmf.1) - 10https://gerrit.wikimedia.org/r/902734 (https://phabricator.wikimedia.org/T331444) (owner: 10Urbanecm) [07:55:13] RECOVERY - PHP7 rendering on parse2017 is OK: HTTP OK: HTTP/1.1 302 Found - 519 bytes in 0.092 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [07:55:36] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:902741|SpecialWikiSets: Avoid calling WikiSet::getId on null (T333075)]] (duration: 16m 45s) [07:55:41] T333075: Error: Call to a member function getId() on null - https://phabricator.wikimedia.org/T333075 [07:58:36] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:902734|GrowthMentors.json: Add a write-only username field (T331444)]] [07:58:41] T331444: MediaWiki:GrowthMentors.json: Add a write-only username field - https://phabricator.wikimedia.org/T331444 [07:59:58] !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:902734|GrowthMentors.json: Add a write-only username field (T331444)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet [08:00:57] RECOVERY - Check systemd state on ms-be2069 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:01:12] (03CR) 10Tacsipacsi: [huwiki] Add Draft and Draft_talk namespaces (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902888 (https://phabricator.wikimedia.org/T333083) (owner: 10Superpes15) [08:02:04] (03PS1) 10Ladsgroup: EntityUsageTable: Mark query as read-only [extensions/Wikibase] (wmf/1.41.0-wmf.1) - 10https://gerrit.wikimedia.org/r/903186 (https://phabricator.wikimedia.org/T332941) [08:02:27] (03CR) 10Ladsgroup: [C: 03+2] EntityUsageTable: Mark query as read-only [extensions/Wikibase] (wmf/1.41.0-wmf.1) - 10https://gerrit.wikimedia.org/r/903186 (https://phabricator.wikimedia.org/T332941) (owner: 10Ladsgroup) [08:02:53] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40331/console" [puppet] - 10https://gerrit.wikimedia.org/r/901576 (https://phabricator.wikimedia.org/T324659) (owner: 10Hashar) [08:03:43] !log Failover m1 from db1164 to db1101 - T331510 [08:03:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:49] T331510: Switchover m1 master (db1164 -> db1101) - https://phabricator.wikimedia.org/T331510 [08:03:54] Amir1: fyi my scap backport's just about to finish [08:04:14] mine takes twenty minutes to merge, don't worry [08:04:14] all done jynus [08:04:21] ok [08:04:28] ok to merge the backup patches? [08:04:47] (03CR) 10Marostegui: [C: 03+2] backups: Replace db1164 with db1101 [puppet] - 10https://gerrit.wikimedia.org/r/903175 (https://phabricator.wikimedia.org/T331510) (owner: 10Marostegui) [08:05:02] Etherpad looks fie [08:05:03] fine [08:05:35] it is a bit slow for me [08:05:53] I guess it's warming up [08:06:04] I can open the test pad fine [08:06:29] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:902734|GrowthMentors.json: Add a write-only username field (T331444)]] (duration: 07m 52s) [08:06:34] T331444: MediaWiki:GrowthMentors.json: Add a write-only username field - https://phabricator.wikimedia.org/T331444 [08:06:51] it is ok for me now [08:06:58] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:07:11] what else to test? [08:07:30] jynus: librenms, which also works fine for me [08:07:31] 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10Marostegui) [08:07:48] orch is complaining about lag, I guess not real? [08:07:53] reload :) [08:08:18] still happening [08:08:29] * urbanecm done [08:08:31] ah I know why [08:09:14] cleanup of the table maybe? [08:09:32] (03PS1) 10Marostegui: db1101: Make it master [puppet] - 10https://gerrit.wikimedia.org/r/903181 (https://phabricator.wikimedia.org/T331510) [08:09:33] jynus: nope, this ^ [08:09:37] I see [08:09:44] (03CR) 10Marostegui: [V: 03+2 C: 03+2] db1101: Make it master [puppet] - 10https://gerrit.wikimedia.org/r/903181 (https://phabricator.wikimedia.org/T331510) (owner: 10Marostegui) [08:10:40] jynus: fixed! [08:11:25] looking at the original path to see why I didn't see that [08:11:29] *patch [08:11:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:12:50] let me run puppet on backup hosts [08:12:52] (03CR) 10Jelto: [V: 03+1 C: 03+1] "lgtm, left one little question in-line" [puppet] - 10https://gerrit.wikimedia.org/r/901576 (https://phabricator.wikimedia.org/T324659) (owner: 10Hashar) [08:12:53] to apply the change [08:16:40] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10Marostegui) [08:17:27] (03CR) 10Ladsgroup: mediawiki: Reduce the frequency of flaggedrevs updates (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/859589 (https://phabricator.wikimedia.org/T323495) (owner: 10Ladsgroup) [08:17:29] (03PS1) 10Marostegui: Revert "backups: Replace db1164 with db1101" [puppet] - 10https://gerrit.wikimedia.org/r/903188 [08:17:32] * urbanecm rollouts one more change [08:17:35] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902455 (https://phabricator.wikimedia.org/T332737) (owner: 10Urbanecm) [08:17:39] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover date" [puppet] - 10https://gerrit.wikimedia.org/r/903188 (owner: 10Marostegui) [08:17:59] (03Merged) 10jenkins-bot: EntityUsageTable: Mark query as read-only [extensions/Wikibase] (wmf/1.41.0-wmf.1) - 10https://gerrit.wikimedia.org/r/903186 (https://phabricator.wikimedia.org/T332941) (owner: 10Ladsgroup) [08:18:18] (03Merged) 10jenkins-bot: [Growth] eswiki: Enable mentorship for 50% of newcomers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902455 (https://phabricator.wikimedia.org/T332737) (owner: 10Urbanecm) [08:18:25] (03PS2) 10Filippo Giunchedi: prometheus1006: depool from alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/900238 (https://phabricator.wikimedia.org/T330165) [08:18:33] !log urbanecm@deploy2002 Backport cancelled. [08:19:45] (03PS1) 10Marostegui: mariadb: Promote db1164 to m1 master [puppet] - 10https://gerrit.wikimedia.org/r/903182 (https://phabricator.wikimedia.org/T333123) [08:19:58] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover date" [puppet] - 10https://gerrit.wikimedia.org/r/903182 (https://phabricator.wikimedia.org/T333123) (owner: 10Marostegui) [08:20:54] (03CR) 10Jcrespo: [C: 03+1] Revert "backups: Replace db1164 with db1101" [puppet] - 10https://gerrit.wikimedia.org/r/903188 (owner: 10Marostegui) [08:20:59] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:903186|EntityUsageTable: Mark query as read-only (T332941)]] [08:21:01] 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10Marostegui) [08:21:05] T332941: Warning: SQLPlatform::isWriteQuery fallback to regex (from Wikibase EntityUsageTable) - https://phabricator.wikimedia.org/T332941 [08:23:23] (03PS1) 10Filippo Giunchedi: wmnet: move reads to graphite2004 [dns] - 10https://gerrit.wikimedia.org/r/903185 (https://phabricator.wikimedia.org/T330165) [08:24:59] (03PS1) 10Filippo Giunchedi: graphite: check graphite2004 [puppet] - 10https://gerrit.wikimedia.org/r/903206 (https://phabricator.wikimedia.org/T330165) [08:25:23] (03CR) 10Marostegui: [C: 04-2] mariadb: Promote db1164 to m1 master [puppet] - 10https://gerrit.wikimedia.org/r/903182 (https://phabricator.wikimedia.org/T333123) (owner: 10Marostegui) [08:25:47] !log urbanecm@deploy2002 Synchronized wmf-config/InitialiseSettings.php: 63dd23b5ceaba35c8d9682493dd21d99a20fc8f7: [Growth] eswiki: Enable mentorship for 50% of newcomers (T332737, T285235) (duration: 06m 09s) [08:25:54] T285235: Activate Growth mentorship at Spanish Wikipedia - https://phabricator.wikimedia.org/T285235 [08:25:54] T332737: Increase percentage of newcomers who receive Growth mentorship at Spanish Wikipedia - https://phabricator.wikimedia.org/T332737 [08:26:40] (03PS1) 10Filippo Giunchedi: statsd: move writes to graphite2004 [puppet] - 10https://gerrit.wikimedia.org/r/903207 (https://phabricator.wikimedia.org/T330165) [08:26:58] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:28:10] (03PS1) 10Filippo Giunchedi: wmnet: move writes to graphite2004 [dns] - 10https://gerrit.wikimedia.org/r/903208 (https://phabricator.wikimedia.org/T330165) [08:28:14] !log restarting bacula at backup1001 T331510 [08:28:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:20] T331510: Switchover m1 master (db1164 -> db1101) - https://phabricator.wikimedia.org/T331510 [08:30:09] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to production for kamila - https://phabricator.wikimedia.org/T332921 (10Ladsgroup) [08:30:29] !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:903186|EntityUsageTable: Mark query as read-only (T332941)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet [08:30:34] T332941: Warning: SQLPlatform::isWriteQuery fallback to regex (from Wikibase EntityUsageTable) - https://phabricator.wikimedia.org/T332941 [08:31:25] (03PS1) 10Filippo Giunchedi: Failover statsd to graphite2004 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903209 (https://phabricator.wikimedia.org/T330165) [08:31:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:32:29] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40332/console" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/901612 (https://phabricator.wikimedia.org/T322357) (owner: 10Hashar) [08:32:31] (03CR) 10JMeybohm: [C: 03+1] "SGTM" [puppet] - 10https://gerrit.wikimedia.org/r/901551 (https://phabricator.wikimedia.org/T319372) (owner: 10Elukey) [08:34:48] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 125 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [08:35:43] (03CR) 10Elukey: [C: 03+1] k8s: Force to be explicit about k8s and calico versions [puppet] - 10https://gerrit.wikimedia.org/r/899641 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm) [08:36:04] (03CR) 10Elukey: k8s: Make profile::kubernetes::pki::intermediate mandatory [puppet] - 10https://gerrit.wikimedia.org/r/899642 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm) [08:36:11] (03PS2) 10Ayounsi: Add cookbook to configure router's BGP sessions to k8s hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/903174 (https://phabricator.wikimedia.org/T306649) [08:36:44] (03CR) 10Jelto: [V: 03+1 C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/901612 (https://phabricator.wikimedia.org/T322357) (owner: 10Hashar) [08:36:45] (JobUnavailable) resolved: Reduced availability for job bacula in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:38:42] (03CR) 10CI reject: [V: 04-1] Add cookbook to configure router's BGP sessions to k8s hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/903174 (https://phabricator.wikimedia.org/T306649) (owner: 10Ayounsi) [08:39:15] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:903186|EntityUsageTable: Mark query as read-only (T332941)]] (duration: 18m 15s) [08:39:22] T332941: Warning: SQLPlatform::isWriteQuery fallback to regex (from Wikibase EntityUsageTable) - https://phabricator.wikimedia.org/T332941 [08:40:24] (03CR) 10Elukey: [C: 03+1] k8s: Remove 1.16 related code (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/899652 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm) [08:40:53] (03CR) 10Elukey: [C: 03+2] role::kafka::jumbo::broker: enable PKI migration settings [puppet] - 10https://gerrit.wikimedia.org/r/901549 (https://phabricator.wikimedia.org/T296064) (owner: 10Elukey) [08:43:54] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10fgiunchedi) [08:45:18] (03PS2) 10Hashar: wm-zuul-status: filter out non-live item [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/902705 (https://phabricator.wikimedia.org/T214068) [08:46:46] (03CR) 10Clément Goubert: [C: 03+1] prometheus1006: depool from alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/900238 (https://phabricator.wikimedia.org/T330165) (owner: 10Filippo Giunchedi) [08:47:02] !log elukey@cumin1001 START - Cookbook sre.kafka.roll-restart-brokers for Kafka A:kafka-jumbo-eqiad cluster: Roll restart of jvm daemons. [08:50:14] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus1006: depool from alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/900238 (https://phabricator.wikimedia.org/T330165) (owner: 10Filippo Giunchedi) [08:51:17] (KafkaBrokerUnavailable) firing: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable [08:52:14] hah that was me, false alarm [08:52:37] prometheus1005 was also depooled, I've repooled it now [08:53:39] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40333/console" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/902132 (https://phabricator.wikimedia.org/T332804) (owner: 10Dduvall) [08:55:02] !log cgoubert@cumin1001 START - Cookbook sre.dns.netbox [08:56:17] (KafkaBrokerUnavailable) resolved: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable [08:57:02] 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10fgiunchedi) [08:57:19] !log cgoubert@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for mw-api-int - cgoubert@cumin1001" [08:58:24] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for mw-api-int - cgoubert@cumin1001" [08:58:24] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:00:45] (03PS1) 10Clément Goubert: mw-api-int: Add records [dns] - 10https://gerrit.wikimedia.org/r/903214 (https://phabricator.wikimedia.org/T333120) [09:02:18] (03CR) 10Jelto: [V: 03+1] "looks mostly good, one question in-line" [puppet] - 10https://gerrit.wikimedia.org/r/902132 (https://phabricator.wikimedia.org/T332804) (owner: 10Dduvall) [09:02:52] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:03:06] RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 81, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:03:10] (03CR) 10Giuseppe Lavagetto: [C: 03+1] mw-api-int: Add records [dns] - 10https://gerrit.wikimedia.org/r/903214 (https://phabricator.wikimedia.org/T333120) (owner: 10Clément Goubert) [09:03:33] (03CR) 10Clément Goubert: [C: 03+2] mw-api-int: Add records [dns] - 10https://gerrit.wikimedia.org/r/903214 (https://phabricator.wikimedia.org/T333120) (owner: 10Clément Goubert) [09:04:08] 10SRE, 10Commons, 10Traffic: Specific PNG thumbnail of SVG file is outdated / stuck (European caching cluster) - https://phabricator.wikimedia.org/T333042 (10Aklapper) [09:06:20] (03PS1) 10Clément Goubert: Revert "mw-api-int: Add records" [dns] - 10https://gerrit.wikimedia.org/r/903190 [09:08:25] (03CR) 10Clément Goubert: [C: 03+2] Revert "mw-api-int: Add records" [dns] - 10https://gerrit.wikimedia.org/r/903190 (owner: 10Clément Goubert) [09:12:17] (KafkaBrokerUnavailable) firing: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable [09:12:59] (03PS1) 10Clément Goubert: mw-api-int: Add records [dns] - 10https://gerrit.wikimedia.org/r/903215 (https://phabricator.wikimedia.org/T333120) [09:13:20] 10SRE, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10netops, and 2 others: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10ayounsi) I pondered multiple options for the Netbox `server_bgp` custom field, feedback from ServiceOps welcome ba... [09:15:23] (03PS2) 10Clément Goubert: mw-api-int: Add records [dns] - 10https://gerrit.wikimedia.org/r/903215 (https://phabricator.wikimedia.org/T333120) [09:17:17] (KafkaBrokerUnavailable) resolved: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable [09:17:18] (03CR) 10Giuseppe Lavagetto: [C: 03+1] mw-api-int: Add records [dns] - 10https://gerrit.wikimedia.org/r/903215 (https://phabricator.wikimedia.org/T333120) (owner: 10Clément Goubert) [09:17:20] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] mediawiki: Reduce the frequency of flaggedrevs updates (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/859589 (https://phabricator.wikimedia.org/T323495) (owner: 10Ladsgroup) [09:18:04] (03CR) 10Clément Goubert: [C: 03+2] mw-api-int: Add records [dns] - 10https://gerrit.wikimedia.org/r/903215 (https://phabricator.wikimedia.org/T333120) (owner: 10Clément Goubert) [09:24:56] (03PS3) 10Ayounsi: Add cookbook to configure router's BGP sessions to k8s hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/903174 (https://phabricator.wikimedia.org/T306649) [09:25:43] (03CR) 10Clément Goubert: [V: 03+1] "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/903217 (https://phabricator.wikimedia.org/T333120) (owner: 10Clément Goubert) [09:27:00] (03CR) 10CI reject: [V: 04-1] Add cookbook to configure router's BGP sessions to k8s hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/903174 (https://phabricator.wikimedia.org/T306649) (owner: 10Ayounsi) [09:33:54] (03PS4) 10Ayounsi: Add cookbook to configure router's BGP sessions to k8s hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/903174 (https://phabricator.wikimedia.org/T306649) [09:39:55] !log cgoubert@cumin1001 START - Cookbook sre.dns.netbox [09:40:10] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "LGTM, optional nits inline." [puppet] - 10https://gerrit.wikimedia.org/r/903217 (https://phabricator.wikimedia.org/T333120) (owner: 10Clément Goubert) [09:40:59] (03PS1) 10Jbond: Offboard nfraison [puppet] - 10https://gerrit.wikimedia.org/r/903219 (https://phabricator.wikimedia.org/T333135) [09:41:05] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:43:37] (03PS2) 10Jbond: Offboard nfraison [puppet] - 10https://gerrit.wikimedia.org/r/903219 (https://phabricator.wikimedia.org/T333135) [09:44:49] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 45295 [09:45:25] (03CR) 10Jbond: [C: 03+2] Offboard nfraison [puppet] - 10https://gerrit.wikimedia.org/r/903219 (https://phabricator.wikimedia.org/T333135) (owner: 10Jbond) [09:45:41] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 45295 [09:46:49] (03CR) 10Clément Goubert: [V: 03+1] service_catalog: Add mw-api-int k8s service (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/903217 (https://phabricator.wikimedia.org/T333120) (owner: 10Clément Goubert) [09:47:07] (03PS2) 10Clément Goubert: service_catalog: Add mw-api-int k8s service [puppet] - 10https://gerrit.wikimedia.org/r/903217 (https://phabricator.wikimedia.org/T333120) [09:47:09] (03CR) 10Effie Mouzeli: [C: 03+1] P:kubernetes::node: Use performance governor [puppet] - 10https://gerrit.wikimedia.org/r/902119 (https://phabricator.wikimedia.org/T332788) (owner: 10Clément Goubert) [09:47:13] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on kafka-main1003.eqiad.wmnet with reason: stop kafka and dist-upgrade [09:47:26] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kafka-main1003.eqiad.wmnet with reason: stop kafka and dist-upgrade [09:50:02] (03PS3) 10Clément Goubert: service_catalog: Add mw-api-int k8s service [puppet] - 10https://gerrit.wikimedia.org/r/903217 (https://phabricator.wikimedia.org/T333120) [09:50:53] (03CR) 10Clément Goubert: service_catalog: Add mw-api-int k8s service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/903217 (https://phabricator.wikimedia.org/T333120) (owner: 10Clément Goubert) [09:51:57] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40336/console" [puppet] - 10https://gerrit.wikimedia.org/r/903217 (https://phabricator.wikimedia.org/T333120) (owner: 10Clément Goubert) [09:54:32] (03PS7) 10Filippo Giunchedi: team-sre/puppet-agent: Add widespread puppet failure alert [alerts] - 10https://gerrit.wikimedia.org/r/902323 (https://phabricator.wikimedia.org/T294564) (owner: 10Jbond) [09:54:34] (03PS3) 10Filippo Giunchedi: team-sre/puppet-agent: Add widespread puppet failure (no resources) alert [alerts] - 10https://gerrit.wikimedia.org/r/902348 (https://phabricator.wikimedia.org/T294564) (owner: 10Jbond) [09:57:46] (03PS2) 10JMeybohm: k8s: Force docker storage-driver to overlay2 [puppet] - 10https://gerrit.wikimedia.org/r/902318 (https://phabricator.wikimedia.org/T332803) [09:58:10] (03CR) 10Filippo Giunchedi: [C: 03+1] monitoring/alerting: globally replace serviceops-collab with sre-collab [puppet] - 10https://gerrit.wikimedia.org/r/902791 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn) [09:59:29] (03CR) 10LSobanski: [C: 04-1] "The change has not been confirmed yet so let's not jump the gun on this." [puppet] - 10https://gerrit.wikimedia.org/r/902791 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn) [09:59:32] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] mediawiki::errorpage: rationalize usage [puppet] - 10https://gerrit.wikimedia.org/r/902446 (owner: 10Giuseppe Lavagetto) [10:00:04] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230327T1000) [10:02:10] 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10hnowlan) [10:02:30] 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10ArielGlenn) [10:02:49] (03CR) 10Jelto: monitoring/alerting: globally replace serviceops-collab with sre-collab (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/902791 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn) [10:03:13] 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10hnowlan) [10:03:49] (03PS3) 10JMeybohm: k8s: Force docker storage-driver to overlay2 [puppet] - 10https://gerrit.wikimedia.org/r/902318 (https://phabricator.wikimedia.org/T332803) [10:03:51] !log depool ms-fe2009 [10:03:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:17] (03CR) 10Jbond: [C: 03+2] "thanks" [alerts] - 10https://gerrit.wikimedia.org/r/902323 (https://phabricator.wikimedia.org/T294564) (owner: 10Jbond) [10:05:30] (03Merged) 10jenkins-bot: team-sre/puppet-agent: Add widespread puppet failure alert [alerts] - 10https://gerrit.wikimedia.org/r/902323 (https://phabricator.wikimedia.org/T294564) (owner: 10Jbond) [10:05:33] (03CR) 10Jbond: [C: 03+2] team-sre/puppet-agent: Add widespread puppet failure (no resources) alert (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/902348 (https://phabricator.wikimedia.org/T294564) (owner: 10Jbond) [10:06:12] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Clement_Goubert) [10:06:24] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 4 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 (10Clement_Goubert) 05Open→03In progress p:05Triage→03Medium a:03Clement_Goubert [10:06:44] (03CR) 10Filippo Giunchedi: [C: 03+1] team-sre/resource: Add disk space [alerts] - 10https://gerrit.wikimedia.org/r/902457 (https://phabricator.wikimedia.org/T332764) (owner: 10Jbond) [10:06:54] (03CR) 10Jbond: [C: 03+2] team-sre/resource: Add disk space [alerts] - 10https://gerrit.wikimedia.org/r/902457 (https://phabricator.wikimedia.org/T332764) (owner: 10Jbond) [10:07:10] (03CR) 10Filippo Giunchedi: "Ben, does this look good to you? thanks!" [alerts] - 10https://gerrit.wikimedia.org/r/901623 (https://phabricator.wikimedia.org/T309009) (owner: 10AOkoth) [10:08:32] (03CR) 10Filippo Giunchedi: [C: 03+1] "Untested but LGTM, thank you Daniel" [puppet] - 10https://gerrit.wikimedia.org/r/902801 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn) [10:08:44] (03CR) 10Filippo Giunchedi: [C: 03+1] releases: remove Icinga monitoring [puppet] - 10https://gerrit.wikimedia.org/r/902785 (https://phabricator.wikimedia.org/T331901) (owner: 10Dzahn) [10:09:15] (03CR) 10Filippo Giunchedi: [C: 03+1] releases-jenkins: replace Icinga with Prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/902788 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn) [10:09:43] (03CR) 10Filippo Giunchedi: [C: 03+1] etherpad: replace Icinga with Prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/902783 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn) [10:10:09] !log dist-upgrade kafka-main1003 manually to bullseye - T332013 [10:10:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:15] T332013: Migrate kafka-main to bullseye - https://phabricator.wikimedia.org/T332013 [10:13:21] (03CR) 10Filippo Giunchedi: "LGTM modulo alert name" [alerts] - 10https://gerrit.wikimedia.org/r/902701 (https://phabricator.wikimedia.org/T332764) (owner: 10Jbond) [10:15:18] (03CR) 10Filippo Giunchedi: [C: 03+1] team-sre/hardware: add alertmanager tests to replace check_ipmi_sensor [alerts] - 10https://gerrit.wikimedia.org/r/902754 (https://phabricator.wikimedia.org/T332764) (owner: 10Jbond) [10:15:34] (03CR) 10EoghanGaffney: [C: 03+1] etherpad: replace Icinga with Prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/902783 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn) [10:17:04] (03CR) 10EoghanGaffney: [C: 03+1] releases: remove Icinga monitoring [puppet] - 10https://gerrit.wikimedia.org/r/902785 (https://phabricator.wikimedia.org/T331901) (owner: 10Dzahn) [10:17:52] (03CR) 10EoghanGaffney: [C: 03+1] releases-jenkins: replace Icinga with Prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/902788 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn) [10:20:11] (03PS1) 10Jbond: cinga: drop nfraison from ACL's [puppet] - 10https://gerrit.wikimedia.org/r/903221 (https://phabricator.wikimedia.org/T333135) [10:20:17] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] mediawiki::tlsproxy::yaml_defs: add error page to envoy [puppet] - 10https://gerrit.wikimedia.org/r/902447 (https://phabricator.wikimedia.org/T287983) (owner: 10Giuseppe Lavagetto) [10:21:16] (03CR) 10JMeybohm: k8s: Force docker storage-driver to overlay2 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/902318 (https://phabricator.wikimedia.org/T332803) (owner: 10JMeybohm) [10:21:26] (03CR) 10Jbond: [C: 03+2] cinga: drop nfraison from ACL's [puppet] - 10https://gerrit.wikimedia.org/r/903221 (https://phabricator.wikimedia.org/T333135) (owner: 10Jbond) [10:21:30] PROBLEM - Check systemd state on ms-be2069 is CRITICAL: CRITICAL - degraded: The following units failed: swift_rclone_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:22:16] (03CR) 10JMeybohm: [C: 03+1] "PCC (expected to fail on alert) https://puppet-compiler.wmflabs.org/output/902318/40337/" [puppet] - 10https://gerrit.wikimedia.org/r/902318 (https://phabricator.wikimedia.org/T332803) (owner: 10JMeybohm) [10:22:17] (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster main-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=main-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [10:22:30] (03CR) 10JMeybohm: [V: 03+1] k8s: Force docker storage-driver to overlay2 [puppet] - 10https://gerrit.wikimedia.org/r/902318 (https://phabricator.wikimedia.org/T332803) (owner: 10JMeybohm) [10:24:26] (WidespreadPuppetFailure) firing: Puppet has failed on kafka_main cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=kafka_main - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [10:24:44] 10SRE, 10Infrastructure Security, 10procurement: Offboard nfraison - https://phabricator.wikimedia.org/T333135 (10jbond) [10:24:49] (RdfStreamingUpdaterFlinkJobUnstable) firing: WCQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [10:24:57] (03CR) 10EoghanGaffney: [C: 03+1] peopleweb: replace Icinga with Prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/902801 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn) [10:25:31] 10SRE, 10Infrastructure Security, 10Infrastructure-Foundations, 10procurement: Offboard nfraison - https://phabricator.wikimedia.org/T333135 (10jbond) [10:27:08] <_joe_> jouncebot: next [10:27:09] In 2 hour(s) and 32 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230327T1300) [10:27:15] <_joe_> jouncebot: now [10:27:15] For the next 0 hour(s) and 32 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230327T1000) [10:27:17] (KafkaUnderReplicatedPartitions) resolved: Under replicated partitions for Kafka cluster main-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=main-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [10:27:28] <_joe_> elukey: this sounds promising ^^ [10:27:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:28:00] !log oblivian@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [10:28:02] yep all recovered :) [10:28:39] !log oblivian@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [10:28:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: Processing latency of WCQS_Streaming_Updater in eqiad (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [10:29:26] (WidespreadPuppetFailure) resolved: Puppet has failed on kafka_main cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=kafka_main - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [10:29:40] (NodeTextfileStale) firing: Stale textfile for puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [10:30:55] !log oblivian@deploy2002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [10:31:20] !log oblivian@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [10:32:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:33:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: Processing latency of WCQS_Streaming_Updater in eqiad (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [10:34:39] (NodeTextfileStale) resolved: Stale textfile for puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [10:34:49] (RdfStreamingUpdaterFlinkJobUnstable) resolved: WCQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [10:35:50] (03PS4) 10EoghanGaffney: Allow E_DEPRECATED logs to be shown on php-fpm in doc machines [puppet] - 10https://gerrit.wikimedia.org/r/900369 (https://phabricator.wikimedia.org/T325245) [10:36:17] (KafkaBrokerUnavailable) firing: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable [10:39:22] this is due to the roll restart --^ [10:39:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST events) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:41:17] (KafkaBrokerUnavailable) resolved: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable [10:41:17] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: IcingaHosts.wait_for_downtimed() does not honor dry_run - https://phabricator.wikimedia.org/T315537 (10SLyngshede-WMF) We're missing a "dry_run" for services and puppet, but Puppet doesn't need is as the decorator also checks for _remote_hosts. [10:41:28] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mesh.configuration: add support for custom error pages [deployment-charts] - 10https://gerrit.wikimedia.org/r/901679 (https://phabricator.wikimedia.org/T287983) (owner: 10Giuseppe Lavagetto) [10:42:45] 10SRE, 10Infrastructure Security, 10Infrastructure-Foundations, 10procurement: Offboard nfraison - https://phabricator.wikimedia.org/T333135 (10jbond) [10:43:18] 10SRE, 10Infrastructure Security, 10Infrastructure-Foundations, 10procurement: Offboard nfraison - https://phabricator.wikimedia.org/T333135 (10jbond) @Dzahn can you take care of password store [10:44:58] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (LIST events) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:45:17] (03PS4) 10Hnowlan: admin_ng: increase namespace cpu quota for thumbor, increase replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/899654 (https://phabricator.wikimedia.org/T328033) [10:47:02] (03Merged) 10jenkins-bot: mesh.configuration: add support for custom error pages [deployment-charts] - 10https://gerrit.wikimedia.org/r/901679 (https://phabricator.wikimedia.org/T287983) (owner: 10Giuseppe Lavagetto) [10:48:05] 10SRE, 10Infrastructure Security, 10Infrastructure-Foundations, 10procurement: Offboard nfraison - https://phabricator.wikimedia.org/T333135 (10BTullis) I will take care of the HBase/Haddoop permissions and any leftover files. [10:48:19] (03CR) 10EoghanGaffney: [C: 03+2] Allow E_DEPRECATED logs to be shown on php-fpm in doc machines [puppet] - 10https://gerrit.wikimedia.org/r/900369 (https://phabricator.wikimedia.org/T325245) (owner: 10EoghanGaffney) [10:52:26] (WidespreadPuppetFailure) firing: Puppet has failed on bastion cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=bastion - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [10:54:14] (03CR) 10Superpes15: [huwiki] Add Draft and Draft_talk namespaces (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902888 (https://phabricator.wikimedia.org/T333083) (owner: 10Superpes15) [10:55:29] (03PS6) 10Superpes15: [huwiki] Add Draft and Draft_talk namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902888 (https://phabricator.wikimedia.org/T333083) [10:55:37] (03PS7) 10Superpes15: [huwiki] Add Draft and Draft_talk namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902888 (https://phabricator.wikimedia.org/T333083) [10:56:49] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:57:03] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:59:26] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: IcingaHosts.wait_for_downtimed() does not honor dry_run - https://phabricator.wikimedia.org/T315537 (10SLyngshede-WMF) PuppetMaster Class needs dry_run, this can be done by letting the class inherit from RemoteHostsAdapter. Service class should have a... [11:01:10] (03CR) 10Tacsipacsi: [C: 03+1] "LGTM, thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902888 (https://phabricator.wikimedia.org/T333083) (owner: 10Superpes15) [11:02:17] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:02:35] RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 81, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:03:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST events) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:04:00] (03Abandoned) 10Samtar: InitialiseSettings.php: Undeploy Phonos from afwiktionary, arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898749 (https://phabricator.wikimedia.org/T332006) (owner: 10Samtar) [11:06:59] (03PS2) 10Jbond: team-sre/systemd: add Check systemd state rule [alerts] - 10https://gerrit.wikimedia.org/r/902701 (https://phabricator.wikimedia.org/T332764) [11:07:07] (03CR) 10CI reject: [V: 04-1] team-sre/systemd: add Check systemd state rule [alerts] - 10https://gerrit.wikimedia.org/r/902701 (https://phabricator.wikimedia.org/T332764) (owner: 10Jbond) [11:07:09] (03CR) 10Jbond: team-sre/systemd: add Check systemd state rule (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/902701 (https://phabricator.wikimedia.org/T332764) (owner: 10Jbond) [11:07:11] (03PS5) 10Hnowlan: admin_ng: increase namespace cpu quota for thumbor, increase replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/899654 (https://phabricator.wikimedia.org/T328033) [11:07:13] (03PS3) 10Jbond: team-sre/systemd: add Check systemd state rule [alerts] - 10https://gerrit.wikimedia.org/r/902701 (https://phabricator.wikimedia.org/T332764) [11:07:28] (03CR) 10Jbond: [C: 03+2] team-sre/hardware: add alertmanager tests to replace check_ipmi_sensor [alerts] - 10https://gerrit.wikimedia.org/r/902754 (https://phabricator.wikimedia.org/T332764) (owner: 10Jbond) [11:07:47] (03CR) 10Jbond: [C: 03+2] team-sre/hardware: Add alert for sel events [alerts] - 10https://gerrit.wikimedia.org/r/902103 (https://phabricator.wikimedia.org/T253810) (owner: 10Jbond) [11:08:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST events) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:09:54] (03PS2) 10Jbond: team-sre/hardware: add alertmanager tests to replace check_ipmi_sensor [alerts] - 10https://gerrit.wikimedia.org/r/902754 (https://phabricator.wikimedia.org/T332764) [11:10:20] (03Merged) 10jenkins-bot: team-sre/hardware: Add alert for sel events [alerts] - 10https://gerrit.wikimedia.org/r/902103 (https://phabricator.wikimedia.org/T253810) (owner: 10Jbond) [11:10:35] 10SRE, 10Infrastructure Security, 10Infrastructure-Foundations, 10procurement: Offboard nfraison - https://phabricator.wikimedia.org/T333135 (10BTullis) I have deleted most of the leftover files and moved useful to my own home directory, but I don't have permission to update the description of this ticket... [11:11:01] 10SRE, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10netops, and 2 others: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10cmooney) Personally I think it's a big conceptual change to introduce a second separate automation-pipeline for th... [11:11:27] 10SRE, 10Infrastructure Security, 10Infrastructure-Foundations, 10procurement: Offboard nfraison - https://phabricator.wikimedia.org/T333135 (10jbond) [11:13:23] 10SRE, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10netops, and 2 others: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10cmooney) On the Netbox side I'm happy with the current status, or having it as a dropdown. I think it's good to k... [11:13:44] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/902444 (https://phabricator.wikimedia.org/T332921) (owner: 10Kamila Součková) [11:15:37] (03CR) 10Jbond: [C: 03+1] "LGTM cheers" [software/spicerack] - 10https://gerrit.wikimedia.org/r/902460 (owner: 10Volans) [11:17:32] (03PS1) 10Slyngshede: Puppet: PuppetMaster class should inherit RemoteHostsAdapter [software/spicerack] - 10https://gerrit.wikimedia.org/r/903235 (https://phabricator.wikimedia.org/T315537) [11:19:07] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:19:25] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:20:43] (03CR) 10Jbond: [V: 03+1 C: 03+2] Remove l10nupdate support [puppet] - 10https://gerrit.wikimedia.org/r/896318 (owner: 10Majavah) [11:20:58] taavi: fyi merging ^^ [11:23:37] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/902449 (owner: 10Volans) [11:24:40] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10Jelto) [11:24:45] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:25:03] RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 81, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:25:37] PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1001 is CRITICAL: 1.044e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [11:27:26] (WidespreadPuppetFailure) resolved: Puppet has failed on bastion cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=bastion - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [11:34:13] 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10Jelto) [11:36:43] (03CR) 10Volans: [C: 03+1] "LGTM, thanks for the addition" [software/spicerack] - 10https://gerrit.wikimedia.org/r/903235 (https://phabricator.wikimedia.org/T315537) (owner: 10Slyngshede) [11:38:26] (WidespreadPuppetFailure) firing: Puppet has failed on lvs cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=lvs - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [11:38:55] (03CR) 10Filippo Giunchedi: [C: 03+1] team-sre/systemd: add Check systemd state rule [alerts] - 10https://gerrit.wikimedia.org/r/902701 (https://phabricator.wikimedia.org/T332764) (owner: 10Jbond) [11:39:17] (KafkaBrokerUnavailable) firing: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable [11:39:21] jbond: did you run the logout cookbook? it seems to affect some puppet runs ^^^ [11:43:29] (03CR) 10Filippo Giunchedi: "LGTM, modulo Ben's vote" [puppet] - 10https://gerrit.wikimedia.org/r/902703 (https://phabricator.wikimedia.org/T309009) (owner: 10Cathal Mooney) [11:44:17] (KafkaBrokerUnavailable) resolved: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable [11:44:24] (03CR) 10Filippo Giunchedi: [C: 03+2] team-sre/hardware: add alertmanager tests to replace check_ipmi_sensor [alerts] - 10https://gerrit.wikimedia.org/r/902754 (https://phabricator.wikimedia.org/T332764) (owner: 10Jbond) [11:44:32] (03CR) 10CI reject: [V: 04-1] team-sre/hardware: add alertmanager tests to replace check_ipmi_sensor [alerts] - 10https://gerrit.wikimedia.org/r/902754 (https://phabricator.wikimedia.org/T332764) (owner: 10Jbond) [11:45:48] fixing ^ [11:46:13] (03PS3) 10Filippo Giunchedi: team-sre/hardware: add alertmanager tests to replace check_ipmi_sensor [alerts] - 10https://gerrit.wikimedia.org/r/902754 (https://phabricator.wikimedia.org/T332764) (owner: 10Jbond) [11:46:50] (03CR) 10Filippo Giunchedi: [C: 03+2] team-sre/hardware: add alertmanager tests to replace check_ipmi_sensor [alerts] - 10https://gerrit.wikimedia.org/r/902754 (https://phabricator.wikimedia.org/T332764) (owner: 10Jbond) [11:47:55] (03Merged) 10jenkins-bot: team-sre/hardware: add alertmanager tests to replace check_ipmi_sensor [alerts] - 10https://gerrit.wikimedia.org/r/902754 (https://phabricator.wikimedia.org/T332764) (owner: 10Jbond) [11:48:15] 10SRE, 10serviceops: mw2420-mw2451 service implementation tracking - https://phabricator.wikimedia.org/T326363 (10Clement_Goubert) 05Open→03Resolved [11:48:20] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10Clement_Goubert) [11:55:50] !log elukey@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-brokers (exit_code=0) for Kafka A:kafka-jumbo-eqiad cluster: Roll restart of jvm daemons. [11:57:05] (03PS3) 10Clément Goubert: Add setting to make /srv/mediawiki -> /srv/mediawiki-staging on deploy servers [puppet] - 10https://gerrit.wikimedia.org/r/901667 (https://phabricator.wikimedia.org/T329857) (owner: 10Ahmon Dancy) [12:00:15] 10SRE, 10Infrastructure Security, 10Infrastructure-Foundations, 10procurement: Offboard nfraison - https://phabricator.wikimedia.org/T333135 (10RobH) Not sure why procurement was added (so it showed up in my notifications) as this user isn't in the acl*procurement review, they are in the acl*sre-team so I... [12:00:22] (03CR) 10Clément Goubert: [C: 03+2] Add setting to make /srv/mediawiki -> /srv/mediawiki-staging on deploy servers [puppet] - 10https://gerrit.wikimedia.org/r/901667 (https://phabricator.wikimedia.org/T329857) (owner: 10Ahmon Dancy) [12:00:24] 10SRE, 10Infrastructure Security, 10Infrastructure-Foundations, 10procurement: Offboard nfraison - https://phabricator.wikimedia.org/T333135 (10RobH) @jbond, this task isn't editable by most users (so i cannot remove the invalid project), please remove the procurement project. [12:01:06] (03PS1) 10Slyngshede: Service: Ensure that dry_run is parsed to dataclass. [software/spicerack] - 10https://gerrit.wikimedia.org/r/903237 (https://phabricator.wikimedia.org/T315537) [12:06:33] 10SRE-OnFire, 10SRE-Sprint-Week-Sustainability-March2023, 10Gerrit, 10serviceops-collab, and 3 others: gerrit1001 running out of space on / - https://phabricator.wikimedia.org/T323262 (10hashar) 05Open→03Resolved a:03Clement_Goubert I have finally filled the follow up task: {T333143} Marking this on... [12:07:39] (03CR) 10Jbond: [C: 04-1] "a few nits and i think an bug" [puppet] - 10https://gerrit.wikimedia.org/r/902808 (https://phabricator.wikimedia.org/T331706) (owner: 10JHathaway) [12:08:53] 10SRE, 10Infrastructure Security, 10Infrastructure-Foundations: Offboard nfraison - https://phabricator.wikimedia.org/T333135 (10jbond) [12:09:04] (03CR) 10Slyngshede: [C: 03+2] Puppet: PuppetMaster class should inherit RemoteHostsAdapter [software/spicerack] - 10https://gerrit.wikimedia.org/r/903235 (https://phabricator.wikimedia.org/T315537) (owner: 10Slyngshede) [12:09:11] (03PS1) 10Filippo Giunchedi: hieradata: move alerting_host to overlay2 [puppet] - 10https://gerrit.wikimedia.org/r/903238 [12:10:02] (03PS2) 10Filippo Giunchedi: hieradata: move alerting_host to overlay2 [puppet] - 10https://gerrit.wikimedia.org/r/903238 (https://phabricator.wikimedia.org/T329939) [12:12:40] (03Merged) 10jenkins-bot: Puppet: PuppetMaster class should inherit RemoteHostsAdapter [software/spicerack] - 10https://gerrit.wikimedia.org/r/903235 (https://phabricator.wikimedia.org/T315537) (owner: 10Slyngshede) [12:13:01] (03PS1) 10EoghanGaffney: Set log format to ecs on doc hosts [puppet] - 10https://gerrit.wikimedia.org/r/903239 (https://phabricator.wikimedia.org/T325245) [12:13:18] (03CR) 10CI reject: [V: 04-1] Set log format to ecs on doc hosts [puppet] - 10https://gerrit.wikimedia.org/r/903239 (https://phabricator.wikimedia.org/T325245) (owner: 10EoghanGaffney) [12:13:24] (03PS2) 10EoghanGaffney: Set log format to ecs on doc hosts [puppet] - 10https://gerrit.wikimedia.org/r/903239 (https://phabricator.wikimedia.org/T325245) [12:13:26] (WidespreadPuppetFailure) resolved: Puppet has failed on lvs cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=lvs - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [12:15:54] (03CR) 10JMeybohm: k8s: Remove 1.16 related code (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/899652 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm) [12:15:58] (03CR) 10Jbond: [C: 03+2] team-sre/systemd: add Check systemd state rule [alerts] - 10https://gerrit.wikimedia.org/r/902701 (https://phabricator.wikimedia.org/T332764) (owner: 10Jbond) [12:17:09] (03Merged) 10jenkins-bot: team-sre/systemd: add Check systemd state rule [alerts] - 10https://gerrit.wikimedia.org/r/902701 (https://phabricator.wikimedia.org/T332764) (owner: 10Jbond) [12:17:16] (03PS1) 10Jbond: Revert "team-sre/hardware: Add alert for sel events" [alerts] - 10https://gerrit.wikimedia.org/r/903192 [12:17:25] (03CR) 10CI reject: [V: 04-1] Revert "team-sre/hardware: Add alert for sel events" [alerts] - 10https://gerrit.wikimedia.org/r/903192 (owner: 10Jbond) [12:17:28] (03CR) 10Jbond: [V: 03+2 C: 03+2] Revert "team-sre/hardware: Add alert for sel events" [alerts] - 10https://gerrit.wikimedia.org/r/903192 (owner: 10Jbond) [12:17:41] (03CR) 10CI reject: [V: 04-1] Revert "team-sre/hardware: Add alert for sel events" [alerts] - 10https://gerrit.wikimedia.org/r/903192 (owner: 10Jbond) [12:19:10] (03PS2) 10Jbond: Revert "team-sre/hardware: Add alert for sel events" [alerts] - 10https://gerrit.wikimedia.org/r/903192 [12:19:52] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40338/console" [puppet] - 10https://gerrit.wikimedia.org/r/903238 (https://phabricator.wikimedia.org/T329939) (owner: 10Filippo Giunchedi) [12:19:54] (03CR) 10Jbond: [C: 03+2] Revert "team-sre/hardware: Add alert for sel events" [alerts] - 10https://gerrit.wikimedia.org/r/903192 (owner: 10Jbond) [12:20:48] (03CR) 10JMeybohm: [V: 03+1 C: 03+1] hieradata: move alerting_host to overlay2 [puppet] - 10https://gerrit.wikimedia.org/r/903238 (https://phabricator.wikimedia.org/T329939) (owner: 10Filippo Giunchedi) [12:21:06] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: move alerting_host to overlay2 [puppet] - 10https://gerrit.wikimedia.org/r/903238 (https://phabricator.wikimedia.org/T329939) (owner: 10Filippo Giunchedi) [12:21:44] (03Merged) 10jenkins-bot: Revert "team-sre/hardware: Add alert for sel events" [alerts] - 10https://gerrit.wikimedia.org/r/903192 (owner: 10Jbond) [12:22:26] (WidespreadPuppetFailure) firing: Puppet has failed on bastion cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=bastion - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [12:23:06] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] k8s: Make profile::kubernetes::pki::intermediate mandatory [puppet] - 10https://gerrit.wikimedia.org/r/899642 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm) [12:23:42] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] k8s: Force to be explicit about k8s and calico versions [puppet] - 10https://gerrit.wikimedia.org/r/899641 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm) [12:27:03] (03CR) 10Jbond: "lgtm but im not sure we need this in the service class, the alertmanager instance is already set correctly which from what i see is the on" [software/spicerack] - 10https://gerrit.wikimedia.org/r/903237 (https://phabricator.wikimedia.org/T315537) (owner: 10Slyngshede) [12:32:36] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40339/console" [puppet] - 10https://gerrit.wikimedia.org/r/902318 (https://phabricator.wikimedia.org/T332803) (owner: 10JMeybohm) [12:36:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:40:40] (03PS4) 10JMeybohm: k8s: Force docker storage-driver to overlay2 [puppet] - 10https://gerrit.wikimedia.org/r/902318 (https://phabricator.wikimedia.org/T332803) [12:41:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:42:34] !log flip alert* to overlay2 - T329939 [12:42:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:40] T329939: alert hosts short of root disk space / docker devicemapper vs overlayfs - https://phabricator.wikimedia.org/T329939 [12:46:36] (SystemdUnitFailed) firing: (4) generate_os_reports.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:47:01] (SystemdUnitFailed) firing: (5) sync_check_icinga_contacts.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:49:01] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/902318 (https://phabricator.wikimedia.org/T332803) (owner: 10JMeybohm) [12:49:09] (03CR) 10Hashar: releases-jenkins: replace Icinga with Prometheus monitoring (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/902788 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn) [12:50:17] (03CR) 10JMeybohm: [C: 03+2] k8s: Force docker storage-driver to overlay2 [puppet] - 10https://gerrit.wikimedia.org/r/902318 (https://phabricator.wikimedia.org/T332803) (owner: 10JMeybohm) [12:50:44] (03CR) 10JMeybohm: [C: 03+2] k8s: Force docker storage-driver to overlay2 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/902318 (https://phabricator.wikimedia.org/T332803) (owner: 10JMeybohm) [12:51:31] (SystemdUnitFailed) firing: (15) sync_check_icinga_contacts.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:51:32] (SystemdUnitFailed) firing: (10) generate_os_reports.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:52:56] (03CR) 10Hashar: [C: 03+1] "Awesome! Feel free to deploy at any time. If Apache2 needs to be restarted that can be done at anytime (the impact is minimal, it is simpl" [puppet] - 10https://gerrit.wikimedia.org/r/903239 (https://phabricator.wikimedia.org/T325245) (owner: 10EoghanGaffney) [12:57:26] (WidespreadPuppetFailure) resolved: Puppet has failed on bastion cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=bastion - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [12:58:50] (03PS1) 10Btullis: Upgrade the research airflow instance [puppet] - 10https://gerrit.wikimedia.org/r/903242 (https://phabricator.wikimedia.org/T326193) [12:59:26] (WidespreadPuppetFailure) firing: Puppet has failed on thumbor cluster - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=thumbor - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: May I have your attention please! UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230327T1300) [13:00:05] Superpes: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:14] o/ I can deploy [13:00:20] Hi taavi :) [13:00:37] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40340/console" [puppet] - 10https://gerrit.wikimedia.org/r/903242 (https://phabricator.wikimedia.org/T326193) (owner: 10Btullis) [13:02:17] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902888 (https://phabricator.wikimedia.org/T333083) (owner: 10Superpes15) [13:03:08] (03Merged) 10jenkins-bot: [huwiki] Add Draft and Draft_talk namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902888 (https://phabricator.wikimedia.org/T333083) (owner: 10Superpes15) [13:03:32] !log taavi@deploy2002 Started scap: Backport for [[gerrit:902888|[huwiki] Add Draft and Draft_talk namespaces (T333083)]] [13:03:38] T333083: Requesting Draft namespace for hu.wikipedia - https://phabricator.wikimedia.org/T333083 [13:03:55] (03CR) 10Hashar: "To clarify: +1 overall, the remarks I have made in the diff comment can be implemented or ruled out later ;)" [puppet] - 10https://gerrit.wikimedia.org/r/902788 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn) [13:04:14] (03PS2) 10Slyngshede: Service: Ensure that dry_run is passed to dataclass. [software/spicerack] - 10https://gerrit.wikimedia.org/r/903237 (https://phabricator.wikimedia.org/T315537) [13:04:58] !log taavi@deploy2002 superpes and taavi: Backport for [[gerrit:902888|[huwiki] Add Draft and Draft_talk namespaces (T333083)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet [13:05:02] 10SRE, 10LDAP-Access-Requests, 10WMF-Legal, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T333157 (10oleksandr_tsyba_WMDE) As the #WMF-Legal project tag was added to this task, some general information to avoid wrong expectations: Please note that public... [13:05:02] Superpes: please test [13:05:06] Looking [13:05:19] (03CR) 10Slyngshede: Service: Ensure that dry_run is passed to dataclass. (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/903237 (https://phabricator.wikimedia.org/T315537) (owner: 10Slyngshede) [13:05:32] (03PS3) 10Slyngshede: Service: Ensure that dry_run is passed to dataclass. [software/spicerack] - 10https://gerrit.wikimedia.org/r/903237 (https://phabricator.wikimedia.org/T315537) [13:06:07] Looks fine thanks :) taavi [13:07:26] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T333157 (10WMDE-leszek) [13:07:56] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T333157 (10WMDE-leszek) Cleaned up the tags a bit, apologies @oleksandr_tsyba_WMDE. we have used a wrong template, again [13:08:07] (03CR) 10Jbond: "thanks" [software/spicerack] - 10https://gerrit.wikimedia.org/r/903237 (https://phabricator.wikimedia.org/T315537) (owner: 10Slyngshede) [13:08:40] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T333157 (10WMDE-leszek) On that note, I endorse this request on WMDE's end. [13:11:31] (SystemdUnitFailed) firing: (19) sync_check_icinga_contacts.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:12:17] !log taavi@deploy2002 Finished scap: Backport for [[gerrit:902888|[huwiki] Add Draft and Draft_talk namespaces (T333083)]] (duration: 08m 45s) [13:12:23] T333083: Requesting Draft namespace for hu.wikipedia - https://phabricator.wikimedia.org/T333083 [13:12:29] done! [13:13:44] Thanks taavi (maybe you have to run NamespaceDupes.php) :) [13:13:51] ohhhh right [13:13:52] a sec [13:14:26] (WidespreadPuppetFailure) resolved: Puppet has failed on thumbor cluster - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=thumbor - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [13:16:31] (SystemdUnitFailed) firing: (64) sync_check_icinga_contacts.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:17:29] (03CR) 10Volans: [C: 03+1] "LGTM, thanks!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/903237 (https://phabricator.wikimedia.org/T315537) (owner: 10Slyngshede) [13:18:01] hm I suspect the script might be broken, it's just printing the same few pagelinks rows over and over again [13:18:05] Amir1: ^ any clues why? [13:18:18] (03PS1) 10Elukey: Move kafka-jumbo1001's kafka broker to PKI certs [puppet] - 10https://gerrit.wikimedia.org/r/903245 (https://phabricator.wikimedia.org/T296064) [13:19:18] (03PS1) 10Ssingh: hiera: temporarily removed dns1003 from authdns_servers [puppet] - 10https://gerrit.wikimedia.org/r/903246 (https://phabricator.wikimedia.org/T330165) [13:19:44] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40341/console" [puppet] - 10https://gerrit.wikimedia.org/r/903245 (https://phabricator.wikimedia.org/T296064) (owner: 10Elukey) [13:20:23] looking at wmf.1 changelog I don't see anything helpful [13:24:17] sorry I was having lunch, let me check [13:24:22] 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops, 10Patch-For-Review: Represent sub-interface and bridge device assocations in Netbox - https://phabricator.wikimedia.org/T296832 (10cmooney) Hoping to kick-start some more discussion around this and try to close this out. I still firmly believe tha... [13:24:50] basically namespaceDupes seems to not update the WHERE condition when printing the list of pagelinks rows it would need to update [13:24:54] yeah, NameSpacesDupes is broken [13:25:03] I got the same issue a week or so age (sorry, forgot to create a task), but it didn't show up when running with --fix [13:25:06] I don't know if it's pagelinks specific or a wider issue [13:27:11] oh, I got that in https://phabricator.wikimedia.org/P45894 too, about a week ago [13:27:16] yeah it's broken, file a task and I'll take a look [13:27:19] (03PS1) 10Andrew Bogott: clouddumps: make clouddumps1002 the primary during switch maintenance [puppet] - 10https://gerrit.wikimedia.org/r/903249 (https://phabricator.wikimedia.org/T330165) [13:27:47] it used to cause data corruption, I'm fine with the current state to be honest [13:27:56] hey folks lemme know when the backport window close (no rush), after that I'll start some maintenance to redis misc clusters [13:28:02] *closes [13:28:33] elukey: we're debugging a maintenance script, might take a while [13:29:14] Amir1: yeah I think I'd prefer leaving some broken rows for now over blindly running with --fix [13:29:25] I don't think we can fix the issue right now [13:29:52] let it be, links tables always have some sorta drifts [13:30:14] my hope would be to do the important fixes and the links one as an argument [13:30:16] but meh [13:30:38] hm [13:31:00] although this is breaking access to those actual pages [13:32:14] so I don't want to leave that broken either [13:34:51] (03CR) 10Btullis: [V: 03+1 C: 03+2] Upgrade the research airflow instance [puppet] - 10https://gerrit.wikimedia.org/r/903242 (https://phabricator.wikimedia.org/T326193) (owner: 10Btullis) [13:35:23] !log fab@deploy2002 Started deploy [airflow-dags/research@d2c115d]: (no justification provided) [13:35:44] !log fab@deploy2002 Finished deploy [airflow-dags/research@d2c115d]: (no justification provided) (duration: 00m 21s) [13:36:31] (03CR) 10Andrew Bogott: [C: 03+2] rbd2backy2: clean up a debugging line [puppet] - 10https://gerrit.wikimedia.org/r/900648 (owner: 10Andrew Bogott) [13:37:12] taavi: can you just comment out the links updates in maint script and re-run it? [13:40:37] Amir1: I think I found the issue [13:41:26] https://gerrit.wikimedia.org/r/c/mediawiki/core/+/829293 should have removed the addQuotes() calls from namespaceDupes.php as buildComparison does it for you [13:41:51] patch incoming [13:45:07] 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops, 10Patch-For-Review: Represent sub-interface and bridge device assocations in Netbox - https://phabricator.wikimedia.org/T296832 (10ayounsi) Overall I agree it's an improvement to have the parent interfaces defined in Netbox. I lost a bit context o... [13:46:14] https://gerrit.wikimedia.org/r/c/mediawiki/core/+/903253 [13:46:44] (03PS1) 10Btullis: Remove stray referece to ariflow db from research instance [puppet] - 10https://gerrit.wikimedia.org/r/903254 (https://phabricator.wikimedia.org/T326193) [13:47:57] thanks for catching it [13:48:24] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40342/console" [puppet] - 10https://gerrit.wikimedia.org/r/903254 (https://phabricator.wikimedia.org/T326193) (owner: 10Btullis) [13:48:50] (03PS1) 10Majavah: namespaceDupes: Remove extra addQuotes() calls [core] (wmf/1.41.0-wmf.1) - 10https://gerrit.wikimedia.org/r/903194 (https://phabricator.wikimedia.org/T333166) [13:49:15] (03CR) 10Btullis: [V: 03+1 C: 03+2] Remove stray referece to ariflow db from research instance [puppet] - 10https://gerrit.wikimedia.org/r/903254 (https://phabricator.wikimedia.org/T326193) (owner: 10Btullis) [13:49:21] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy2002 using scap backport" [core] (wmf/1.41.0-wmf.1) - 10https://gerrit.wikimedia.org/r/903194 (https://phabricator.wikimedia.org/T333166) (owner: 10Majavah) [13:50:37] RECOVERY - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1001 is OK: (C)1e+05 gt (W)1e+04 gt 2 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [13:53:11] (03CR) 10Andrew Bogott: [C: 03+2] dumps: properly absent enterprise timers [puppet] - 10https://gerrit.wikimedia.org/r/902833 (owner: 10Majavah) [13:55:58] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 (10JJMC89) [13:58:22] 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops, 10Patch-For-Review: Represent sub-interface and bridge device assocations in Netbox - https://phabricator.wikimedia.org/T296832 (10cmooney) >>! In T296832#8729090, @ayounsi wrote: > I lost a bit context on how it will be done on a day to bay basis,... [13:58:39] (03PS1) 10Majavah: hieradata: swap eqiad1 dns server order [puppet] - 10https://gerrit.wikimedia.org/r/903257 [13:58:41] (03PS1) 10Majavah: hieradata: remove unused keys from labsdnsconfig [puppet] - 10https://gerrit.wikimedia.org/r/903258 [14:00:21] (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/903239 (https://phabricator.wikimedia.org/T325245) (owner: 10EoghanGaffney) [14:00:58] (03CR) 10Andrew Bogott: [C: 03+2] hieradata: swap eqiad1 dns server order [puppet] - 10https://gerrit.wikimedia.org/r/903257 (owner: 10Majavah) [14:01:22] (03CR) 10EoghanGaffney: [C: 03+2] Set log format to ecs on doc hosts [puppet] - 10https://gerrit.wikimedia.org/r/903239 (https://phabricator.wikimedia.org/T325245) (owner: 10EoghanGaffney) [14:01:25] (03PS1) 10Majavah: Point openstack.eqiad1 CNAME to cloudcontrol1007 [dns] - 10https://gerrit.wikimedia.org/r/903259 [14:02:07] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Revert "Revert: Remove the .Values.kubernetesApi hack" [deployment-charts] - 10https://gerrit.wikimedia.org/r/902078 (owner: 10Giuseppe Lavagetto) [14:04:57] (03CR) 10Slyngshede: [C: 03+2] Service: Ensure that dry_run is passed to dataclass. [software/spicerack] - 10https://gerrit.wikimedia.org/r/903237 (https://phabricator.wikimedia.org/T315537) (owner: 10Slyngshede) [14:05:59] (03CR) 10Andrew Bogott: [C: 03+2] Point openstack.eqiad1 CNAME to cloudcontrol1007 [dns] - 10https://gerrit.wikimedia.org/r/903259 (owner: 10Majavah) [14:06:20] (03Merged) 10jenkins-bot: namespaceDupes: Remove extra addQuotes() calls [core] (wmf/1.41.0-wmf.1) - 10https://gerrit.wikimedia.org/r/903194 (https://phabricator.wikimedia.org/T333166) (owner: 10Majavah) [14:06:36] !log taavi@deploy2002 Started scap: Backport for [[gerrit:903194|namespaceDupes: Remove extra addQuotes() calls (T333166)]] [14:06:43] T333166: namespaceDupes is broken - https://phabricator.wikimedia.org/T333166 [14:07:17] (03Merged) 10jenkins-bot: Revert "Revert: Remove the .Values.kubernetesApi hack" [deployment-charts] - 10https://gerrit.wikimedia.org/r/902078 (owner: 10Giuseppe Lavagetto) [14:08:00] !log taavi@deploy2002 taavi: Backport for [[gerrit:903194|namespaceDupes: Remove extra addQuotes() calls (T333166)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet [14:08:03] (03PS1) 10Hashar: gerrit: set gitiles clone url to http (Gerrit 3.6.2) [puppet] - 10https://gerrit.wikimedia.org/r/903260 (https://phabricator.wikimedia.org/T206049) [14:09:18] (03Merged) 10jenkins-bot: Service: Ensure that dry_run is passed to dataclass. [software/spicerack] - 10https://gerrit.wikimedia.org/r/903237 (https://phabricator.wikimedia.org/T315537) (owner: 10Slyngshede) [14:10:57] jouncebot: next [14:10:57] In 1 hour(s) and 19 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230327T1530) [14:11:19] elukey: give me just a few more minutes please [14:12:13] sure, I was just checking next windows :) [14:14:39] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [14:14:39] !log oblivian@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [14:14:51] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [14:15:04] !log taavi@deploy2002 Finished scap: Backport for [[gerrit:903194|namespaceDupes: Remove extra addQuotes() calls (T333166)]] (duration: 08m 27s) [14:15:09] !log oblivian@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [14:15:09] T333166: namespaceDupes is broken - https://phabricator.wikimedia.org/T333166 [14:15:32] (03CR) 10JHathaway: [C: 03+1] "Thanks for removing this cruft, looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/899652 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm) [14:16:03] !log taavi@mwmaint2002 ~ $ mwscript namespaceDupes.php --wiki=huwiki --fix # T333083 [14:16:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:08] T333083: Requesting Draft namespace for hu.wikipedia - https://phabricator.wikimedia.org/T333083 [14:16:09] elukey: all done! [14:16:20] nice thanks! [14:16:31] (SystemdUnitFailed) firing: (71) sync_check_icinga_contacts.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:16:59] Wow wonderful taavi [14:17:09] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [14:17:16] Thanks :) [14:17:17] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [14:21:08] (03PS2) 10Andrew Bogott: clouddumps: make clouddumps1002 the primary during switch maintenance [puppet] - 10https://gerrit.wikimedia.org/r/903249 (https://phabricator.wikimedia.org/T330165) [14:21:10] (03PS1) 10Andrew Bogott: Mark cloudvirt1017, 1021 and 1022 as spare systems. [puppet] - 10https://gerrit.wikimedia.org/r/903263 (https://phabricator.wikimedia.org/T333169) [14:21:25] (03PS8) 10Giuseppe Lavagetto: charts: upgrade to mesh 1.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/901769 (https://phabricator.wikimedia.org/T287983) [14:21:32] (SystemdUnitFailed) firing: (73) sync_check_icinga_contacts.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:24:26] (03CR) 10Andrew Bogott: [C: 03+2] clouddumps: make clouddumps1002 the primary during switch maintenance [puppet] - 10https://gerrit.wikimedia.org/r/903249 (https://phabricator.wikimedia.org/T330165) (owner: 10Andrew Bogott) [14:24:53] (03CR) 10Andrew Bogott: [C: 03+2] Mark cloudvirt1017, 1021 and 1022 as spare systems. [puppet] - 10https://gerrit.wikimedia.org/r/903263 (https://phabricator.wikimedia.org/T333169) (owner: 10Andrew Bogott) [14:27:22] (03PS1) 10EoghanGaffney: Assign insetup role to new aphlict vm [puppet] - 10https://gerrit.wikimedia.org/r/903264 (https://phabricator.wikimedia.org/T322369) [14:27:28] (03PS1) 10Slyngshede: P:url_downloader send Squid access logs to Logstash [puppet] - 10https://gerrit.wikimedia.org/r/903265 [14:27:45] !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/api-gateway: sync [14:28:05] !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: sync [14:28:14] !log elukey@deploy2002 helmfile [codfw] START helmfile.d/services/api-gateway: sync [14:28:33] !log elukey@deploy2002 helmfile [codfw] DONE helmfile.d/services/api-gateway: sync [14:28:57] !log elukey@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop: sync [14:29:13] (03PS1) 10Bking: rdf-streaming-updater: use correct config path for dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/903266 (https://phabricator.wikimedia.org/T328675) [14:29:14] !log elukey@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop: sync [14:29:23] !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop: sync [14:29:39] !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop: sync [14:29:45] (03CR) 10DCausse: [C: 03+1] rdf-streaming-updater: use correct config path for dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/903266 (https://phabricator.wikimedia.org/T328675) (owner: 10Bking) [14:30:04] (03CR) 10Bking: [C: 03+2] rdf-streaming-updater: use correct config path for dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/903266 (https://phabricator.wikimedia.org/T328675) (owner: 10Bking) [14:30:25] !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: sync [14:33:44] (03PS2) 10Slyngshede: P:url_downloader send Squid access logs to Logstash [puppet] - 10https://gerrit.wikimedia.org/r/903265 [14:34:52] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40344/console" [puppet] - 10https://gerrit.wikimedia.org/r/903265 (owner: 10Slyngshede) [14:35:45] (03Merged) 10jenkins-bot: rdf-streaming-updater: use correct config path for dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/903266 (https://phabricator.wikimedia.org/T328675) (owner: 10Bking) [14:38:31] (03CR) 10Hnowlan: [C: 03+1] changeprop-jobqueue: Double resource quotas [deployment-charts] - 10https://gerrit.wikimedia.org/r/902067 (owner: 10Alexandros Kosiaris) [14:39:21] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [14:39:28] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [14:40:16] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [14:40:32] (03CR) 10Slyngshede: [V: 03+1] "During sprint-week I noticed that we're not collecting Squid access logs from the urldownload servers." [puppet] - 10https://gerrit.wikimedia.org/r/903265 (owner: 10Slyngshede) [14:40:34] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply [14:40:56] !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: sync [14:41:56] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to production for kamila - https://phabricator.wikimedia.org/T332921 (10MNadrofsky) Approved. [14:43:28] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [14:43:30] (03PS1) 10Andrew Bogott: Revert "Mark cloudvirt1017, 1021 and 1022 as spare systems." [puppet] - 10https://gerrit.wikimedia.org/r/903268 (https://phabricator.wikimedia.org/T333169) [14:43:59] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [14:44:29] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [14:44:55] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [14:45:06] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [14:45:15] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [14:46:27] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [14:46:35] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [14:46:37] (03CR) 10Andrew Bogott: [C: 03+2] Revert "Mark cloudvirt1017, 1021 and 1022 as spare systems." [puppet] - 10https://gerrit.wikimedia.org/r/903268 (https://phabricator.wikimedia.org/T333169) (owner: 10Andrew Bogott) [14:47:41] !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop: sync [14:47:55] !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop: sync [14:48:05] !log elukey@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop: sync [14:48:15] !log elukey@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop: sync [14:52:34] !log eoghan@cumin1001 START - Cookbook sre.ganeti.makevm for new host aphlict1002.eqiad.wmnet [14:52:35] !log eoghan@cumin1001 START - Cookbook sre.dns.netbox [14:53:41] 10SRE-Access-Requests, 10Lift-Wing, 10Machine-Learning-Team: Machine Learning team - k8s resources ccess - https://phabricator.wikimedia.org/T333174 (10isarantopoulos) [14:55:03] !log eoghan@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM aphlict1002.eqiad.wmnet - eoghan@cumin1001" [14:55:07] 10SRE-Access-Requests, 10Lift-Wing, 10Machine-Learning-Team: Machine Learning team - k8s resources ccess - https://phabricator.wikimedia.org/T333174 (10isarantopoulos) [14:56:07] !log eoghan@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM aphlict1002.eqiad.wmnet - eoghan@cumin1001" [14:56:07] !log eoghan@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:56:07] !log eoghan@cumin1001 START - Cookbook sre.dns.wipe-cache aphlict1002.eqiad.wmnet on all recursors [14:56:10] !log eoghan@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) aphlict1002.eqiad.wmnet on all recursors [14:57:30] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Trizek-WMF) [14:57:42] 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jan-Mar-2023), 10Datacenter-Switchover: CommRel support for March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T328287 (10Trizek-WMF) 05In progress→03Resolved A post-action document has been created. There is nothing special to highl... [14:57:50] (03CR) 10Jbond: "see inline" [puppet] - 10https://gerrit.wikimedia.org/r/903265 (owner: 10Slyngshede) [14:58:18] (03PS4) 10Samtar: deployment-prep: update prometheus host to prometheus05 [puppet] - 10https://gerrit.wikimedia.org/r/868510 (https://phabricator.wikimedia.org/T324782) [15:01:31] (SystemdUnitFailed) firing: (16) sync_check_icinga_contacts.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:05:52] !log eoghan@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host aphlict1002.eqiad.wmnet [15:11:31] (SystemdUnitFailed) firing: (16) sync_check_icinga_contacts.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:11:49] (03PS10) 10Jbond: sre.SREDnsDiscBatchRunnerBase: add new base class for dnsdisc services [cookbooks] - 10https://gerrit.wikimedia.org/r/849130 [15:11:52] (03PS20) 10Jbond: sre.puppetboard.restart-reboot: create a reboot book for puppetboard [cookbooks] - 10https://gerrit.wikimedia.org/r/849093 [15:14:02] (03CR) 10CI reject: [V: 04-1] sre.SREDnsDiscBatchRunnerBase: add new base class for dnsdisc services [cookbooks] - 10https://gerrit.wikimedia.org/r/849130 (owner: 10Jbond) [15:14:14] (03CR) 10CI reject: [V: 04-1] sre.puppetboard.restart-reboot: create a reboot book for puppetboard [cookbooks] - 10https://gerrit.wikimedia.org/r/849093 (owner: 10Jbond) [15:15:17] (03PS1) 10Vgutierrez: varnish: Bypass ATS for esitest requests [puppet] - 10https://gerrit.wikimedia.org/r/903274 (https://phabricator.wikimedia.org/T308799) [15:16:19] (03PS2) 10Vgutierrez: varnish: Bypass ATS for esitest requests [puppet] - 10https://gerrit.wikimedia.org/r/903274 (https://phabricator.wikimedia.org/T308799) [15:16:32] (SystemdUnitFailed) firing: (26) sync_check_icinga_contacts.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:16:32] (SystemdUnitFailed) firing: (13) generate_os_reports.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:17:00] !log elukey@deploy2002 Synchronized private/PrivateSettings.php: (no justification provided) (duration: 06m 10s) [15:17:45] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:17:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:19:57] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop: sync [15:20:32] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40345/console" [puppet] - 10https://gerrit.wikimedia.org/r/903274 (https://phabricator.wikimedia.org/T308799) (owner: 10Vgutierrez) [15:20:39] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop: sync [15:20:44] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop: sync [15:20:48] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop: sync [15:21:32] (SystemdUnitFailed) firing: (53) sync_check_icinga_contacts.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:21:32] (SystemdUnitFailed) firing: (13) generate_os_reports.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:22:18] (03PS21) 10Jbond: sre.puppetboard.restart-reboot: create a reboot book for puppetboard [cookbooks] - 10https://gerrit.wikimedia.org/r/849093 [15:22:22] (03PS3) 10Vgutierrez: varnish: Bypass ATS for esitest requests [puppet] - 10https://gerrit.wikimedia.org/r/903274 (https://phabricator.wikimedia.org/T308799) [15:22:25] (03PS8) 10Jbond: sre.netbox.restart-reboot: create a reboot book for netbox [cookbooks] - 10https://gerrit.wikimedia.org/r/849135 [15:22:35] (03PS9) 10Jbond: sre.netbox.restart-reboot: create a reboot book for netbox [cookbooks] - 10https://gerrit.wikimedia.org/r/849135 [15:22:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:23:27] PROBLEM - puppet last run on rdb2008 is CRITICAL: CRITICAL: Puppet last ran 6 days ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [15:24:49] (03CR) 10CI reject: [V: 04-1] sre.netbox.restart-reboot: create a reboot book for netbox [cookbooks] - 10https://gerrit.wikimedia.org/r/849135 (owner: 10Jbond) [15:25:12] (03CR) 10CI reject: [V: 04-1] sre.puppetboard.restart-reboot: create a reboot book for puppetboard [cookbooks] - 10https://gerrit.wikimedia.org/r/849093 (owner: 10Jbond) [15:26:19] 10SRE-swift-storage, 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 10): Storage request: swift s3 bucket for mediawiki-page-content-change-enrichment checkpointing - https://phabricator.wikimedia.org/T330693 (10Ottomata) > usefulness of cross-DC replication After asking @dcausse, I unde... [15:27:16] 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops, 10Patch-For-Review: Represent sub-interface and bridge device assocations in Netbox - https://phabricator.wikimedia.org/T296832 (10Volans) Looks ok to me too, I'm no sure about all the details involved if we need to patch things like the dns genera... [15:29:13] PROBLEM - Check systemd state on db1101 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-mysqld-exporter@s7.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:29:23] RECOVERY - puppet last run on rdb2008 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [15:29:45] (03CR) 10BBlack: [C: 03+1] "Seems right to me, for this testing!" [puppet] - 10https://gerrit.wikimedia.org/r/903274 (https://phabricator.wikimedia.org/T308799) (owner: 10Vgutierrez) [15:30:05] jan_drewniak: May I have your attention please! Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230327T1530) [15:32:44] (03PS22) 10Jbond: sre.puppetboard.restart-reboot: create a reboot book for puppetboard [cookbooks] - 10https://gerrit.wikimedia.org/r/849093 [15:32:55] (03PS10) 10Jbond: sre.netbox.restart-reboot: create a reboot book for netbox [cookbooks] - 10https://gerrit.wikimedia.org/r/849135 [15:33:00] (03PS11) 10Jbond: sre.netbox.restart-reboot: create a reboot book for netbox [cookbooks] - 10https://gerrit.wikimedia.org/r/849135 [15:34:49] (03CR) 10CI reject: [V: 04-1] sre.puppetboard.restart-reboot: create a reboot book for puppetboard [cookbooks] - 10https://gerrit.wikimedia.org/r/849093 (owner: 10Jbond) [15:35:46] (03CR) 10CI reject: [V: 04-1] sre.netbox.restart-reboot: create a reboot book for netbox [cookbooks] - 10https://gerrit.wikimedia.org/r/849135 (owner: 10Jbond) [15:36:36] 10SRE, 10MediaWiki-extensions-OAuth, 10Performance-Team, 10Datacenter-Switchover: Frequent OAuth failures on Wikimedia wikis since eqiad was repooled due to db-mainstash replication lag - https://phabricator.wikimedia.org/T332650 (10Samwalton9) [15:42:03] 10SRE, 10SRE-Access-Requests: Grant Access to analytics_privatedata_users for FNavas-foundation - https://phabricator.wikimedia.org/T331482 (10Ottomata) Hi, just back from vacation too. @FNavas-foundation can you update the task description with exactly what you need access too? Your comment mentions a 'spe... [15:44:15] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:46:31] (SystemdUnitFailed) firing: (11) generate_os_reports.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:46:51] (03PS1) 10Ayounsi: Varnish: prefix 403 and 429 with a unique ID [puppet] - 10https://gerrit.wikimedia.org/r/903284 (https://phabricator.wikimedia.org/T330973) [15:46:55] (03PS1) 10Filippo Giunchedi: alertmanager: default to IRC for foundations [puppet] - 10https://gerrit.wikimedia.org/r/903285 [15:47:02] jbond: ^ [15:50:26] (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/903264 (https://phabricator.wikimedia.org/T322369) (owner: 10EoghanGaffney) [15:50:53] (03Abandoned) 10Abijeet Patro: Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/903228 (owner: 10L10n-bot) [15:51:53] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Yiannis Giannelos - https://phabricator.wikimedia.org/T332063 (10Ottomata) I think they need [[ https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Access_Levels | sql_lab role permissions in Superset ]]. Pinging @Milimetr... [15:53:35] (03CR) 10Hnowlan: [C: 03+2] admin: add user kamila [puppet] - 10https://gerrit.wikimedia.org/r/902444 (https://phabricator.wikimedia.org/T332921) (owner: 10Kamila Součková) [15:53:38] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to production for kamila - https://phabricator.wikimedia.org/T332921 (10hnowlan) [15:54:04] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 (10Clement_Goubert) [15:54:52] !log ebysans@deploy2002 Started deploy [airflow-dags/analytics@e7f9c7f]: (no justification provided) [15:55:03] !log ebysans@deploy2002 Finished deploy [airflow-dags/analytics@e7f9c7f]: (no justification provided) (duration: 00m 11s) [15:55:44] (03PS11) 10Jbond: sre.SREDnsDiscBatchRunnerBase: add new base class for dnsdisc services [cookbooks] - 10https://gerrit.wikimedia.org/r/849130 [15:56:40] (03PS4) 10Vgutierrez: varnish: Bypass ATS for esitest requests [puppet] - 10https://gerrit.wikimedia.org/r/903274 (https://phabricator.wikimedia.org/T308799) [15:57:13] (03CR) 10Jbond: sre.SREDnsDiscBatchRunnerBase: add new base class for dnsdisc services (036 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/849130 (owner: 10Jbond) [15:57:56] (03CR) 10CI reject: [V: 04-1] sre.SREDnsDiscBatchRunnerBase: add new base class for dnsdisc services [cookbooks] - 10https://gerrit.wikimedia.org/r/849130 (owner: 10Jbond) [15:58:10] (03CR) 10Filippo Giunchedi: [C: 03+2] alertmanager: default to IRC for foundations [puppet] - 10https://gerrit.wikimedia.org/r/903285 (owner: 10Filippo Giunchedi) [15:58:14] (03CR) 10Dzahn: [C: 03+2] planet: Add Wikimedia category of Jan Ainali's blog [puppet] - 10https://gerrit.wikimedia.org/r/902829 (owner: 10Legoktm) [15:58:45] (03CR) 10Dzahn: [C: 03+2] planet: Add Nemo_bis's new blog [puppet] - 10https://gerrit.wikimedia.org/r/902828 (owner: 10Legoktm) [15:59:16] (03PS2) 10Dzahn: planet: Add Wikimedia category of Jan Ainali's blog [puppet] - 10https://gerrit.wikimedia.org/r/902829 (owner: 10Legoktm) [16:03:53] (03PS12) 10Jbond: sre.SREDnsDiscBatchRunnerBase: add new base class for dnsdisc services [cookbooks] - 10https://gerrit.wikimedia.org/r/849130 [16:04:12] (03CR) 10Dzahn: [C: 03+2] planet: Use a more human date format than "Friday, 03 2023 March" [puppet] - 10https://gerrit.wikimedia.org/r/902832 (owner: 10Krinkle) [16:04:15] (03CR) 10Alexandros Kosiaris: [C: 03+2] admin_ng: Merge eqiad,codfw namespaces quotes in main [deployment-charts] - 10https://gerrit.wikimedia.org/r/902066 (owner: 10Alexandros Kosiaris) [16:04:22] (03CR) 10Alexandros Kosiaris: [C: 03+2] changeprop-jobqueue: Double resource quotas [deployment-charts] - 10https://gerrit.wikimedia.org/r/902067 (owner: 10Alexandros Kosiaris) [16:06:10] (03CR) 10CI reject: [V: 04-1] sre.SREDnsDiscBatchRunnerBase: add new base class for dnsdisc services [cookbooks] - 10https://gerrit.wikimedia.org/r/849130 (owner: 10Jbond) [16:06:52] (03CR) 10Jbond: "for the mypy alerts we need to wait for a spicerack release" [cookbooks] - 10https://gerrit.wikimedia.org/r/849130 (owner: 10Jbond) [16:08:14] (03CR) 10Dzahn: [C: 03+1] "looks good, I do want to rename the role to sre_collab, but that will require rebasing one way or another" [puppet] - 10https://gerrit.wikimedia.org/r/903264 (https://phabricator.wikimedia.org/T322369) (owner: 10EoghanGaffney) [16:10:03] (03Merged) 10jenkins-bot: admin_ng: Merge eqiad,codfw namespaces quotes in main [deployment-charts] - 10https://gerrit.wikimedia.org/r/902066 (owner: 10Alexandros Kosiaris) [16:10:05] (03Merged) 10jenkins-bot: changeprop-jobqueue: Double resource quotas [deployment-charts] - 10https://gerrit.wikimedia.org/r/902067 (owner: 10Alexandros Kosiaris) [16:14:06] (03PS1) 10Alexandros Kosiaris: admin: Grant kserve API group read access to deploy user [deployment-charts] - 10https://gerrit.wikimedia.org/r/903297 (https://phabricator.wikimedia.org/T333174) [16:15:05] (03CR) 10Alexandros Kosiaris: "Luca, Janis, regardless of the outcome of the discussion in the linked task, let me know if this is the preferable way of doing this." [deployment-charts] - 10https://gerrit.wikimedia.org/r/903297 (https://phabricator.wikimedia.org/T333174) (owner: 10Alexandros Kosiaris) [16:20:52] (03PS1) 10Alexandros Kosiaris: admin: Fix mw-web resource quotas [deployment-charts] - 10https://gerrit.wikimedia.org/r/903298 [16:25:33] (03PS1) 10Alexandros Kosiaris: admin: Make sure resource quotas are honored for staging too [deployment-charts] - 10https://gerrit.wikimedia.org/r/903299 [16:26:14] (03CR) 10Alexandros Kosiaris: [C: 03+2] admin: Fix mw-web resource quotas [deployment-charts] - 10https://gerrit.wikimedia.org/r/903298 (owner: 10Alexandros Kosiaris) [16:29:40] (03PS1) 10Jbond: os-reports: fix yaml data for apt_repo [puppet] - 10https://gerrit.wikimedia.org/r/903301 [16:30:22] (03CR) 10Jbond: [V: 03+2 C: 03+2] os-reports: fix yaml data for apt_repo [puppet] - 10https://gerrit.wikimedia.org/r/903301 (owner: 10Jbond) [16:31:05] (03Merged) 10jenkins-bot: admin: Fix mw-web resource quotas [deployment-charts] - 10https://gerrit.wikimedia.org/r/903298 (owner: 10Alexandros Kosiaris) [16:31:29] (03CR) 10Alexandros Kosiaris: [C: 03+2] admin: Make sure resource quotas are honored for staging too [deployment-charts] - 10https://gerrit.wikimedia.org/r/903299 (owner: 10Alexandros Kosiaris) [16:32:35] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:33:28] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'. [16:34:14] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [16:34:20] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [16:34:31] (SystemdUnitFailed) firing: (10) generate_os_reports.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:34:35] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [16:36:05] (03Merged) 10jenkins-bot: admin: Make sure resource quotas are honored for staging too [deployment-charts] - 10https://gerrit.wikimedia.org/r/903299 (owner: 10Alexandros Kosiaris) [16:39:09] RECOVERY - Host db1150 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [16:39:42] !log akosiaris@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [16:39:58] !log akosiaris@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [16:40:03] hashar: changeprop-jobqueue resource-quotas doubled [16:40:12] !log akosiaris@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [16:40:59] !log akosiaris@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [16:43:45] sigh, pinged the wrong person, sorry Antoine [16:43:50] hnowlan: changeprop-jobqueue resource-quotas doubled [16:44:00] (03PS1) 10Jbond: idm: remove auto restart for apache-htcacheclean [puppet] - 10https://gerrit.wikimedia.org/r/903302 [16:44:28] (03CR) 10Jbond: [V: 03+2 C: 03+2] idm: remove auto restart for apache-htcacheclean [puppet] - 10https://gerrit.wikimedia.org/r/903302 (owner: 10Jbond) [16:45:23] (03PS1) 10Jbond: Revert "idm: remove auto restart for apache-htcacheclean" [puppet] - 10https://gerrit.wikimedia.org/r/903199 [16:45:29] (03CR) 10Jbond: [V: 03+2 C: 03+2] Revert "idm: remove auto restart for apache-htcacheclean" [puppet] - 10https://gerrit.wikimedia.org/r/903199 (owner: 10Jbond) [16:47:38] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T331373 (10Jhancock.wm) I logged into the ilo and as of now there are no errors on that link. Papaul pointed me to T330218 where he suggested moving the network port from ge-6/0/6 to ge-6/0/1. Since this issue comes back intermittently, i... [16:48:13] 10SRE, 10Infrastructure Security, 10Infrastructure-Foundations: Offboard nfraison - https://phabricator.wikimedia.org/T333135 (10Dzahn) @jbond Is there a process for offboarding that describes how to do it correctly? As basically every single time I am trying to edit pwstore it is blocked by an invalid key... [16:49:29] akosiaris: thanks! [16:49:57] RECOVERY - Check systemd state on idm1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:53:59] PROBLEM - Host cloudlb2001-dev is DOWN: PING CRITICAL - Packet loss = 100% [16:54:23] RECOVERY - Host cloudlb2001-dev is UP: PING OK - Packet loss = 0%, RTA = 31.69 ms [16:58:29] 10SRE, 10Infrastructure Security, 10Infrastructure-Foundations: Offboard nfraison - https://phabricator.wikimedia.org/T333135 (10Dzahn) I removed both nfraison and eoghan from the .users file, re-signed it and then re-encrypted all files that I could encrypt, then pushed to repo. This does not change their... [16:59:40] (SystemdUnitFailed) firing: (9) idm-sync-permissions.service Failed on idm-test1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:00:05] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230327T1700) [17:00:05] ryankemper: May I have your attention please! Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230327T1700) [17:05:26] (WidespreadPuppetFailure) firing: Puppet has failed on lvs cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=lvs - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [17:06:26] er what's this lvs failure, checking [17:08:25] hmm ran agent manually, resolved. must be transient [17:15:26] (WidespreadPuppetFailure) resolved: Puppet has failed on lvs cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=lvs - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [17:19:43] 10SRE, 10Infrastructure Security, 10Infrastructure-Foundations: Offboard nfraison - https://phabricator.wikimedia.org/T333135 (10jbond) [17:19:57] 10SRE, 10Infrastructure Security, 10Infrastructure-Foundations: Offboard nfraison - https://phabricator.wikimedia.org/T333135 (10jbond) 05In progress→03Resolved > @jbond Is there a process for offboarding that describes how to do it correctly? not really the [[ https://wikitech.wikimedia.org/wiki/SRE_Of... [17:20:35] 10SRE, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10netops, and 2 others: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10ayounsi) > Aside from duplication of code what are the blockers to having the Kubernetes groups also in Homer? Th... [17:23:59] (03CR) 10Dzahn: "Should this be merged before the upgrade or should it wait until the upgrade?" [puppet] - 10https://gerrit.wikimedia.org/r/903260 (https://phabricator.wikimedia.org/T206049) (owner: 10Hashar) [17:25:26] (03CR) 10Dzahn: "We already have bullseye doc machines thanks to Andrea's work. We should just switch to those." [puppet] - 10https://gerrit.wikimedia.org/r/901612 (https://phabricator.wikimedia.org/T322357) (owner: 10Hashar) [17:27:55] (03CR) 10Dzahn: "just means there will be a lot more rebasing because we keep adding to this. in that case it's easier to abandon it" [puppet] - 10https://gerrit.wikimedia.org/r/902791 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn) [17:28:04] (03Abandoned) 10Dzahn: monitoring/alerting: globally replace serviceops-collab with sre-collab [puppet] - 10https://gerrit.wikimedia.org/r/902791 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn) [17:31:52] 10SRE, 10SRE-Access-Requests: Move mbsantos and jgiannelos from parsoid-test-admins to parsoid-test-roots - https://phabricator.wikimedia.org/T333206 (10ssastry) [17:34:59] (03PS1) 10Subramanya Sastry: Parsoid-Tests: Give Yiannis and Mateus root rights on parsoid-test hosts [puppet] - 10https://gerrit.wikimedia.org/r/903314 (https://phabricator.wikimedia.org/T333206) [17:37:39] (03CR) 10Slyngshede: [V: 03+1] P:url_downloader send Squid access logs to Logstash (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/903265 (owner: 10Slyngshede) [17:41:05] 10SRE, 10ops-eqiad, 10DBA, 10Data-Persistence-Backup: db1150 crashed: DIMM_A8 memory issues - https://phabricator.wikimedia.org/T332708 (10Cmjohnson) 05Open→03Resolved The DIMM has been replaced, I updated the idrac and bios while it was offline. [17:45:13] (03CR) 10Dzahn: releases-jenkins: replace Icinga with Prometheus monitoring (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/902788 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn) [17:56:01] (SystemdUnitFailed) firing: (12) wmf_auto_restart_apache2-htcacheclean.service Failed on arclamp1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:59:16] (SystemdUnitFailed) firing: (12) wmf_auto_restart_apache2-htcacheclean.service Failed on arclamp1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:06:26] (WidespreadPuppetFailure) firing: Puppet has failed on dnsbox cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=dnsbox - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [18:07:35] (03PS3) 10Andrea Denisse: doc: Add support for passive_hosts synchronization via rsync [puppet] - 10https://gerrit.wikimedia.org/r/902825 (https://phabricator.wikimedia.org/T319477) [18:09:33] (03CR) 10CI reject: [V: 04-1] doc: Add support for passive_hosts synchronization via rsync [puppet] - 10https://gerrit.wikimedia.org/r/902825 (https://phabricator.wikimedia.org/T319477) (owner: 10Andrea Denisse) [18:11:15] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10Papaul) We moved the fist batch of servers today all went well. [18:15:56] (03PS1) 10Dzahn: alertmanager: send sre-collab alerts to -operations and -sre-collab [puppet] - 10https://gerrit.wikimedia.org/r/903318 (https://phabricator.wikimedia.org/T329587) [18:16:26] (03PS1) 10Andrea Denisse: doc: Reserve UID/GID for the doc-uploader system user [puppet] - 10https://gerrit.wikimedia.org/r/903319 (https://phabricator.wikimedia.org/T319477) [18:19:00] (03CR) 10Andrea Denisse: [V: 03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40349/console" [puppet] - 10https://gerrit.wikimedia.org/r/903319 (https://phabricator.wikimedia.org/T319477) (owner: 10Andrea Denisse) [18:19:21] (03PS2) 10Andrea Denisse: doc: Reserve UID/GID for the doc-uploader system user [puppet] - 10https://gerrit.wikimedia.org/r/903319 (https://phabricator.wikimedia.org/T319477) [18:20:23] (03CR) 10Andrea Denisse: [V: 03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40350/console" [puppet] - 10https://gerrit.wikimedia.org/r/903319 (https://phabricator.wikimedia.org/T319477) (owner: 10Andrea Denisse) [18:20:42] (03PS3) 10Andrea Denisse: doc: Reserve UID/GID for the doc-uploader system user [puppet] - 10https://gerrit.wikimedia.org/r/903319 (https://phabricator.wikimedia.org/T319477) [18:21:47] (03CR) 10Andrea Denisse: [V: 03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40351/console" [puppet] - 10https://gerrit.wikimedia.org/r/903319 (https://phabricator.wikimedia.org/T319477) (owner: 10Andrea Denisse) [18:25:05] (03PS4) 10Andrea Denisse: doc: Reserve UID/GID for the doc-uploader system user [puppet] - 10https://gerrit.wikimedia.org/r/903319 (https://phabricator.wikimedia.org/T319477) [18:28:50] (03CR) 10Dzahn: "It looks ok in compiler, and I can check in devtools, but I don't want to get into follow-ups in deployment-prep and other projects." [puppet] - 10https://gerrit.wikimedia.org/r/900353 (https://phabricator.wikimedia.org/T329622) (owner: 10Jaime Nuche) [18:29:05] (03CR) 10Dzahn: [C: 03+2] deployment_server: ensure Docker is installed [puppet] - 10https://gerrit.wikimedia.org/r/900353 (https://phabricator.wikimedia.org/T329622) (owner: 10Jaime Nuche) [18:31:27] (03CR) 10Dzahn: [C: 03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/903319 (https://phabricator.wikimedia.org/T319477) (owner: 10Andrea Denisse) [18:37:08] (03PS14) 10Ayounsi: Refactor and centralize BGPpeer config [deployment-charts] - 10https://gerrit.wikimedia.org/r/887945 (https://phabricator.wikimedia.org/T306649) [18:37:18] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10Papaul) @cmooney second batch proposal below |Host|U space|Existing port|New port| |cloudcephosd2002-de... [18:37:51] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T333157 (10KFrancis) Hello, @oleksandr_tsyba_WMDE, I'll be helping with this request. Would you please send your WMDE email address to kfrancis@wikimedia.org? [18:41:26] (WidespreadPuppetFailure) resolved: Puppet has failed on dnsbox cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=dnsbox - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [18:41:42] (03CR) 10CI reject: [V: 04-1] Refactor and centralize BGPpeer config [deployment-charts] - 10https://gerrit.wikimedia.org/r/887945 (https://phabricator.wikimedia.org/T306649) (owner: 10Ayounsi) [18:43:47] (03CR) 10Dzahn: [C: 03+2] "confirmed noop on production deployment servers, deploy1002 and deploy2002 - fails in devtools, mostly expected" [puppet] - 10https://gerrit.wikimedia.org/r/900353 (https://phabricator.wikimedia.org/T329622) (owner: 10Jaime Nuche) [18:45:35] 10SRE, 10MediaWiki-extensions-OAuth, 10Datacenter-Switchover, 10Performance-Team (Radar): Frequent OAuth failures on Wikimedia wikis since eqiad was repooled due to db-mainstash replication lag - https://phabricator.wikimedia.org/T332650 (10larissagaulia) [18:45:48] (03PS1) 10Dzahn: Revert "deployment_server: ensure Docker is installed" [puppet] - 10https://gerrit.wikimedia.org/r/903200 [18:46:00] (03CR) 10Dzahn: [C: 03+2] "unable to configure the Docker daemon with file /etc/docker/daemon.json: the following directives don't match any configuration option: st" [puppet] - 10https://gerrit.wikimedia.org/r/903200 (owner: 10Dzahn) [18:51:02] (03CR) 10Dzahn: "what fails about them? or rather, what bothered you about them?" [puppet] - 10https://gerrit.wikimedia.org/r/903179 (https://phabricator.wikimedia.org/T331896) (owner: 10Jcrespo) [18:52:43] (03PS1) 10Dzahn: Revert "bacula: Add miscweb2003 jobs to the list of monitoring-ignored jobs" [puppet] - 10https://gerrit.wikimedia.org/r/903201 [18:54:42] (03CR) 10Dzahn: [C: 03+2] "puppet works again on deploy-1004.devtools after reverting so should be fine in deployment-prep as well" [puppet] - 10https://gerrit.wikimedia.org/r/903200 (owner: 10Dzahn) [18:56:30] (03CR) 10Dzahn: "changes to admin groups might require access request tickets, this should be done between clinic duty and serviceops team. I don't have co" [puppet] - 10https://gerrit.wikimedia.org/r/903314 (https://phabricator.wikimedia.org/T333206) (owner: 10Subramanya Sastry) [18:57:37] (03CR) 10Dzahn: "If I do https://gerrit.wikimedia.org/r/c/operations/puppet/+/903318 first then we will still have IRC alerting as before." [puppet] - 10https://gerrit.wikimedia.org/r/902785 (https://phabricator.wikimedia.org/T331901) (owner: 10Dzahn) [18:57:46] (03CR) 10Dzahn: "If I do https://gerrit.wikimedia.org/r/c/operations/puppet/+/903318 first then we will still have IRC alerting as before." [puppet] - 10https://gerrit.wikimedia.org/r/902788 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn) [19:06:56] (03CR) 10Dzahn: zuul: fix up service enable and ensure (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/901576 (https://phabricator.wikimedia.org/T324659) (owner: 10Hashar) [19:07:48] (03PS1) 10Jdlrobson: Expand list of wikis with language button at top. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903322 (https://phabricator.wikimedia.org/T331777) [19:10:10] (03PS1) 10Superpes15: [sysop_itwiki] Add the logo also for vector 2022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903323 (https://phabricator.wikimedia.org/T330279) [19:11:17] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/output/901576/40355/" [puppet] - 10https://gerrit.wikimedia.org/r/901576 (https://phabricator.wikimedia.org/T324659) (owner: 10Hashar) [19:15:00] (03CR) 10Dzahn: [C: 03+2] "confirmed this changed nothing on all 3 contint* servers. zuul is still running on contint2001, masked on contint1002 and unknown on conti" [puppet] - 10https://gerrit.wikimedia.org/r/901576 (https://phabricator.wikimedia.org/T324659) (owner: 10Hashar) [19:18:29] (03PS2) 10Jdlrobson: Enable web based viewing of ReadingLists on mediawiki.org and metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902197 (https://phabricator.wikimedia.org/T322093) [19:21:22] !log ebernhardson@deploy2002 Started deploy [airflow-dags/search@3259099]: bump glent jar to 0.3.2 [19:21:37] !log ebernhardson@deploy2002 Finished deploy [airflow-dags/search@3259099]: bump glent jar to 0.3.2 (duration: 00m 14s) [19:25:29] (03PS1) 10Superpes15: Disable VisualEditor from talk namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903326 [19:26:01] (SystemdUnitFailed) firing: (12) wmf_auto_restart_apache2-htcacheclean.service Failed on arclamp1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:29:11] (SystemdUnitFailed) firing: (12) wmf_auto_restart_apache2-htcacheclean.service Failed on arclamp1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:30:03] 10SRE, 10Traffic, 10HTTPS: Enable HSTS on store.wikimedia.org for HTTPS - https://phabricator.wikimedia.org/T128559 (10greg) Hi @BCornwall ! I'm just jumping in as an FR-Tech representative. I think I've got the summary here (basically, in the end, Shopify can't meet our hsts header needs which blocks overal... [19:40:47] (03PS1) 10Dzahn: admin: add Barakat Ajadi as ldap_only_admin (wmf group) [puppet] - 10https://gerrit.wikimedia.org/r/903328 (https://phabricator.wikimedia.org/T332868) [19:43:25] (03PS2) 10Superpes15: [sysop_itwiki] Add the logo also for vector 2022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903323 (https://phabricator.wikimedia.org/T330279) [19:44:08] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Grafana access to babiola - https://phabricator.wikimedia.org/T332868 (10Dzahn) @larissagaulia Thank you for adding the information. Does "until July" mean "until last day of June"? I uploaded a code change above that is now in review. Access requests... [19:45:13] (03CR) 10Ahmon Dancy: Revert "deployment_server: ensure Docker is installed" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/903200 (owner: 10Dzahn) [19:45:32] (03PS3) 10Superpes15: [sysop_itwiki] Add the logo also for vector 2022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903323 (https://phabricator.wikimedia.org/T330279) [19:46:20] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Grafana access to babiola - https://phabricator.wikimedia.org/T332868 (10Dzahn) 05Open→03In progress p:05Triage→03Medium [19:46:31] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Grafana access to babiola - https://phabricator.wikimedia.org/T332868 (10Dzahn) a:03Ladsgroup [19:49:11] (03CR) 10Dzahn: "this may have caused that you can't include the docker class on new hosts anymore without a puppet error. : https://gerrit.wikimedia.org/r" [puppet] - 10https://gerrit.wikimedia.org/r/902318 (https://phabricator.wikimedia.org/T332803) (owner: 10JMeybohm) [19:52:02] (03CR) 10Ahmon Dancy: k8s: Force docker storage-driver to overlay2 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/902318 (https://phabricator.wikimedia.org/T332803) (owner: 10JMeybohm) [19:56:01] (SystemdUnitFailed) firing: (12) wmf_auto_restart_apache2-htcacheclean.service Failed on arclamp1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:59:11] (SystemdUnitFailed) firing: (12) wmf_auto_restart_apache2-htcacheclean.service Failed on arclamp1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230327T2000). [20:00:04] jdlrobson and Superpes: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:31] here [20:00:47] I can deploy [20:00:54] Hi :) [20:01:11] Jdlrobson: is it safe to deploy your two together? [20:01:51] kindrobot: yep [20:01:55] !log start UTC late backport window [20:01:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:02:56] (03PS1) 10Ahmon Dancy: k8s: Use storage-driver instead of storage_driver [puppet] - 10https://gerrit.wikimedia.org/r/903329 (https://phabricator.wikimedia.org/T332803) [20:04:58] Jdlrobson: what's modern-manpage? [20:05:04] (03CR) 10Ahmon Dancy: k8s: Force docker storage-driver to overlay2 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/902318 (https://phabricator.wikimedia.org/T332803) (owner: 10JMeybohm) [20:05:53] Oh, mainpage. I feel silly [20:06:43] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kindrobot@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903322 (https://phabricator.wikimedia.org/T331777) (owner: 10Jdlrobson) [20:06:46] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kindrobot@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902197 (https://phabricator.wikimedia.org/T322093) (owner: 10Jdlrobson) [20:07:31] (03Merged) 10jenkins-bot: Expand list of wikis with language button at top. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903322 (https://phabricator.wikimedia.org/T331777) (owner: 10Jdlrobson) [20:08:50] (03PS3) 10Stef Dunlap: Enable web based viewing of ReadingLists on mediawiki.org and metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902197 (https://phabricator.wikimedia.org/T322093) (owner: 10Jdlrobson) [20:09:03] (03CR) 10TrainBranchBot: "Approved by kindrobot@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902197 (https://phabricator.wikimedia.org/T322093) (owner: 10Jdlrobson) [20:09:47] (03Merged) 10jenkins-bot: Enable web based viewing of ReadingLists on mediawiki.org and metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902197 (https://phabricator.wikimedia.org/T322093) (owner: 10Jdlrobson) [20:10:02] !log kindrobot@deploy2002 Started scap: Backport for [[gerrit:903322|Expand list of wikis with language button at top. (T331777)]], [[gerrit:902197|Enable web based viewing of ReadingLists on mediawiki.org and metawiki (T322093)]] [20:10:09] T322093: Enable viewing of reading lists on web - https://phabricator.wikimedia.org/T322093 [20:10:10] T331777: Add a header at the top of the Main page of French Wikiquote - https://phabricator.wikimedia.org/T331777 [20:11:29] !log kindrobot@deploy2002 jdlrobson and kindrobot: Backport for [[gerrit:903322|Expand list of wikis with language button at top. (T331777)]], [[gerrit:902197|Enable web based viewing of ReadingLists on mediawiki.org and metawiki (T322093)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet [20:12:08] Jdlrobson: ready to check [20:13:18] looking :) [20:14:01] !log andrew@cumin1001 START - Cookbook sre.hosts.decommission for hosts cloudvirt1017.eqiad.wmnet [20:14:11] (SystemdUnitFailed) firing: (13) wmf_auto_restart_apache2-htcacheclean.service Failed on arclamp1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:14:33] kindrobot: LGTM! [20:15:12] Great, syncing. [20:15:53] Superpes is it safe to deploy your two patches together? [20:16:01] (SystemdUnitFailed) firing: (14) wmf_auto_restart_apache2-htcacheclean.service Failed on arclamp1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:16:44] Yep no issue :) kindrobot [20:17:12] (03PS1) 10Andrew Bogott: Remove puppet references to cloudvirt1017, 1021, and 1022 [puppet] - 10https://gerrit.wikimedia.org/r/903330 (https://phabricator.wikimedia.org/T333169) [20:18:27] (03CR) 10Andrew Bogott: [C: 03+2] Remove puppet references to cloudvirt1017, 1021, and 1022 [puppet] - 10https://gerrit.wikimedia.org/r/903330 (https://phabricator.wikimedia.org/T333169) (owner: 10Andrew Bogott) [20:19:11] (SystemdUnitFailed) firing: (15) wmf_auto_restart_prometheus-icinga-am.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:19:23] (03CR) 10Subramanya Sastry: Parsoid-Tests: Give Yiannis and Mateus root rights on parsoid-test hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/903314 (https://phabricator.wikimedia.org/T333206) (owner: 10Subramanya Sastry) [20:20:52] !log kindrobot@deploy2002 Finished scap: Backport for [[gerrit:903322|Expand list of wikis with language button at top. (T331777)]], [[gerrit:902197|Enable web based viewing of ReadingLists on mediawiki.org and metawiki (T322093)]] (duration: 10m 50s) [20:21:01] (SystemdUnitFailed) firing: (28) wmf_auto_restart_prometheus-icinga-am.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:21:02] T322093: Enable viewing of reading lists on web - https://phabricator.wikimedia.org/T322093 [20:21:02] T331777: Add a header at the top of the Main page of French Wikiquote - https://phabricator.wikimedia.org/T331777 [20:21:25] !log andrew@cumin1001 START - Cookbook sre.dns.netbox [20:22:36] Amir1: should we be worried about these systemd units failing before proceeding with the backports? [20:23:20] !log andrew@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudvirt1017.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1001" [20:24:11] (SystemdUnitFailed) firing: (67) wmf_auto_restart_prometheus-icinga-am.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:25:01] !log andrew@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudvirt1017.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1001" [20:25:01] !log andrew@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:25:02] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudvirt1017.eqiad.wmnet [20:25:20] !log andrew@cumin1001 START - Cookbook sre.hosts.decommission for hosts cloudvirt1021.eqiad.wmnet [20:26:01] (SystemdUnitFailed) firing: (67) wmf_auto_restart_prometheus-icinga-am.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:27:15] kindrobot: which one is it? [20:27:37] thanks kindrobot ! looking good on production! [20:27:54] the alert2001 one, it should be fine for now [20:28:36] SystemdUnitFailed) firing: (67) wmf_auto_restart_prometheus-icinga-am.service Failed on alert2001:9100 | (SystemdUnitFailed) firing: (14) wmf_auto_restart_apache2-htcacheclean.service Failed on arclamp1001:9100 [20:28:57] it's a new alert, I suspect it's actually failing for longer [20:29:07] there is way too many systemd unit fail, sigh https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:29:19] it's ongoing for a while it seems [20:29:44] cwhite: maybe you know what's going on? speically on alert2001 [20:30:00] So would you advise continuing with the backports? [20:31:04] !log andrew@cumin1001 START - Cookbook sre.dns.netbox [20:31:28] kindrobot: I'd ignore the alerts and continue [20:31:40] OK, thank you. :) [20:31:46] yeah, it's not related for sure [20:32:51] (03CR) 10EoghanGaffney: [C: 03+2] Assign insetup role to new aphlict vm [puppet] - 10https://gerrit.wikimedia.org/r/903264 (https://phabricator.wikimedia.org/T322369) (owner: 10EoghanGaffney) [20:33:16] (03PS4) 10Stef Dunlap: [sysop_itwiki] Add the logo also for vector 2022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903323 (https://phabricator.wikimedia.org/T330279) (owner: 10Superpes15) [20:33:16] !log andrew@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudvirt1021.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1001" [20:34:31] Amir1: thanks for the heads up, I'll look into the auto restart failure [20:35:13] !log andrew@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudvirt1021.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1001" [20:35:13] !log andrew@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:35:14] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudvirt1021.eqiad.wmnet [20:35:36] (03PS2) 10Stef Dunlap: Disable VisualEditor from talk namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903326 (owner: 10Superpes15) [20:35:42] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kindrobot@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903326 (owner: 10Superpes15) [20:35:44] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kindrobot@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903323 (https://phabricator.wikimedia.org/T330279) (owner: 10Superpes15) [20:36:10] !log andrew@cumin1001 START - Cookbook sre.hosts.decommission for hosts cloudvirt1022.eqiad.wmnet [20:37:32] (03Merged) 10jenkins-bot: [sysop_itwiki] Add the logo also for vector 2022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903323 (https://phabricator.wikimedia.org/T330279) (owner: 10Superpes15) [20:41:26] (03CR) 10JMeybohm: [C: 03+1] "Very true. Sorry for causing trouble!" [puppet] - 10https://gerrit.wikimedia.org/r/903329 (https://phabricator.wikimedia.org/T332803) (owner: 10Ahmon Dancy) [20:41:49] !log andrew@cumin1001 START - Cookbook sre.dns.netbox [20:43:50] !log andrew@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudvirt1022.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1001" [20:45:04] !log andrew@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudvirt1022.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1001" [20:45:04] !log andrew@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:45:05] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudvirt1022.eqiad.wmnet [20:45:19] 10ops-eqiad, 10cloud-services-team, 10decommission-hardware: decommission cloudvirt1017.eqiad.wmnet and cloudvirt102[1-2].eqiad.wmnet - https://phabricator.wikimedia.org/T333169 (10Andrew) a:05Andrew→03Jclark-ctr [20:46:41] (03PS1) 10Jgreen: payments-listener.frdev.wikimedia.org cname for new FR staging server [dns] - 10https://gerrit.wikimedia.org/r/903332 (https://phabricator.wikimedia.org/T285892) [20:50:29] cwhite: it might be related but we are getting systemd unit fail on db1101 but the alert doesn't make sense [20:50:47] as it's really not failing [20:51:03] (maybe? I'll check) [20:51:24] From the auto-restart timer? [20:51:25] (SystemdUnitFailed) firing: (13) wmf_auto_restart_prometheus-icinga-am.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:52:08] 10SRE, 10Traffic, 10HTTPS: Enable HSTS on store.wikimedia.org for HTTPS - https://phabricator.wikimedia.org/T128559 (10BCornwall) Hey, @greg. It's not blocking overall improvement, it's just not [[ https://wikitech.wikimedia.org/wiki/HTTPS#Current_policies_and_standards | complying with standards ]]. Since s... [20:52:36] (03CR) 10Jgreen: [C: 03+2] payments-listener.frdev.wikimedia.org cname for new FR staging server [dns] - 10https://gerrit.wikimedia.org/r/903332 (https://phabricator.wikimedia.org/T285892) (owner: 10Jgreen) [20:55:34] Amir1: db1101 is not an s7 host anymore? [20:55:54] probably Manuel moved it but he is not around [20:56:10] I think he said he reset the systemd timer [20:57:42] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [20:57:44] PROBLEM - Host restbase1033 is DOWN: PING CRITICAL - Packet loss = 100% [20:57:56] I'm guessing there are auto-restart timers lingering that aren't being cleaned up by puppet. [20:57:58] PROBLEM - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/feed/onthisday/{type}/{month}/{day} (retrieve selected events on January 15) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, [20:57:58] th aggregated=true)) timed out before a response was received https://wikitech.wikimedia.org/wiki/Wikifeeds [20:59:36] cwhite: it's not a S7 no [20:59:40] it's in M1 [21:00:03] I disabled both systemd units [21:00:05] Reedy, sbassett, Maryum, and manfredi: #bothumor My software never has bugs. It just develops random features. Rise for Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230327T2100). [21:00:42] Note: the backport deploy window is still in progress [21:01:06] (SystemdUnitFailed) firing: (4) idm-sync-permissions.service Failed on idm-test1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:01:12] taavi: it seems like its stalled out. It's cleared CI, but it hasn't merged [21:02:25] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/903326/ [21:02:50] RECOVERY - Host restbase1033 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [21:03:46] PROBLEM - cassandra-c CQL 10.64.48.153:9042 on restbase1033 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [21:04:38] PROBLEM - cassandra-b CQL 10.64.48.152:9042 on restbase1033 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [21:07:22] dancy: ^ [21:11:02] !log moving Universal Code of Conduct/Enforcement guidelines -> Universal Code of Conduct/Enforcement guidelines/Version 1 on metawiki with `extensions/Translate/scripts/moveTranslatableBundle.php ` [21:11:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:17] kindrobot: somehow the +2 was applied to PS1 while PS2 was the latest [21:11:19] (probably don't need to log that but just in case) [21:11:25] Uh it doen't want to merge it [21:11:51] Oh [21:12:17] hrm [21:12:33] so just re-+2 it and probably file a bug in scap [21:12:39] It should probably be OK to scap backport again, eh? [21:12:42] OK. [21:12:43] (03CR) 10Andrea Denisse: [V: 03+1 C: 03+2] doc: Reserve UID/GID for the doc-uploader system user [puppet] - 10https://gerrit.wikimedia.org/r/903319 (https://phabricator.wikimedia.org/T319477) (owner: 10Andrea Denisse) [21:12:48] +1 [21:13:10] Thank you all. [21:13:17] (03CR) 10TrainBranchBot: "Approved by kindrobot@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903326 (owner: 10Superpes15) [21:14:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:14:12] (03Merged) 10jenkins-bot: Disable VisualEditor from talk namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903326 (owner: 10Superpes15) [21:14:25] !log kindrobot@deploy2002 Started scap: Backport for [[gerrit:903326|Disable VisualEditor from talk namespace]], [[gerrit:903323|[sysop_itwiki] Add the logo also for vector 2022 (T330279)]] [21:14:31] T330279: Change logo, wordmark and favicon of sysop_itwiki - https://phabricator.wikimedia.org/T330279 [21:14:54] !log htriedman@deploy2002 Started deploy [airflow-dags/platform_eng@5f0eb44]: (no justification provided) [21:15:08] !log htriedman@deploy2002 Finished deploy [airflow-dags/platform_eng@5f0eb44]: (no justification provided) (duration: 00m 13s) [21:15:49] !log kindrobot@deploy2002 kindrobot and superpes: Backport for [[gerrit:903326|Disable VisualEditor from talk namespace]], [[gerrit:903323|[sysop_itwiki] Add the logo also for vector 2022 (T330279)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet [21:16:23] Ready to check Superpes [21:17:02] Checked both and everything is fine kindrobot! Thanks! :) [21:17:22] Thanks, syncing [21:18:28] RECOVERY - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [21:18:47] (03CR) 10Dduvall: buildkitd: Isolate build container user/process/network namespaces (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/902132 (https://phabricator.wikimedia.org/T332804) (owner: 10Dduvall) [21:19:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:22:31] (03CR) 10Dduvall: buildkitd: Isolate build container user/process/network namespaces (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/902132 (https://phabricator.wikimedia.org/T332804) (owner: 10Dduvall) [21:22:51] !log kindrobot@deploy2002 Finished scap: Backport for [[gerrit:903326|Disable VisualEditor from talk namespace]], [[gerrit:903323|[sysop_itwiki] Add the logo also for vector 2022 (T330279)]] (duration: 08m 26s) [21:22:57] T330279: Change logo, wordmark and favicon of sysop_itwiki - https://phabricator.wikimedia.org/T330279 [21:23:40] Sync finished. Thanks everyone. [21:23:52] !log finish UTC late backports [21:23:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:24:11] (SystemdUnitFailed) firing: (5) idm-sync-permissions.service Failed on idm-test1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:24:16] (SystemdUnitFailed) firing: (30) wmf_auto_restart_apache2-htcacheclean.service Failed on arclamp1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:24:26] (WidespreadPuppetFailure) firing: Puppet has failed on wcqs cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=wcqs - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [21:24:30] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [21:24:56] !log start of watchlist clean up in arwiki (T328501) [21:24:59] Reedy, sbassett, Maryum, and manfredi backport window finished :) [21:25:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:25:01] T328501: Request to clean my watchlist from articles in namespace 0 and 1 - https://phabricator.wikimedia.org/T328501 [21:25:15] Thanks for your time kindrobot :D [21:25:53] No problem, thank you. :) [21:26:16] (SystemdUnitFailed) firing: (37) wmf_auto_restart_apache2-htcacheclean.service Failed on arclamp1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:27:38] PROBLEM - Restbase root url on restbase1033 is CRITICAL: connect to address 10.64.48.71 and port 7231: Connection refused https://wikitech.wikimedia.org/wiki/RESTBase [21:29:11] (SystemdUnitFailed) firing: (37) wmf_auto_restart_apache2-htcacheclean.service Failed on arclamp1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:29:24] PROBLEM - SSH on restbase1033 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [21:30:56] PROBLEM - cassandra-a CQL 10.64.48.151:9042 on restbase1033 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [21:35:17] 10SRE, 10vm-requests: Site: 1 VM request for doc2002 - https://phabricator.wikimedia.org/T332819 (10andrea.denisse) [21:37:59] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Yiannis Giannelos - https://phabricator.wikimedia.org/T332063 (10Ladsgroup) superset should be automatically done via wmf ldap group. If Jgiannelos is in the ldap group, it should be done already. Correct? [21:39:35] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for ItamarWMDE - https://phabricator.wikimedia.org/T331899 (10Ladsgroup) a:03Ladsgroup I'm on clinic duty this week. Waiting for signoff by Tyler. Maybe a deployment training can be arranged (or other devs in wmde can do an i... [21:40:02] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for ItamarWMDE - https://phabricator.wikimedia.org/T331899 (10Ladsgroup) https://wikitech.wikimedia.org/wiki/Deployments/Training [21:42:50] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for ItamarWMDE - https://phabricator.wikimedia.org/T331899 (10taavi) > To be able to access deplyed Wiki instances and ensure that wikibase (namely wikibase client) is properly installed and configured Unless you're also planni... [21:43:53] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Yiannis Giannelos - https://phabricator.wikimedia.org/T332063 (10Ottomata) Superset has its own 'roles', and I think something changed in a recent version that makes is so the default role doesn't have access to the SQL lab feat... [21:45:34] !log T330165 Depooled relevant search platform hosts: `sudo -E cumin 'elastic[1055-1056,1074-1079,1085-1086]*,cloudelastic100[2,6]*,wcqs1002*,wdqs[1007,1012]*' 'sudo depool'` [21:45:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:45:40] T330165: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 [21:49:11] (SystemdUnitFailed) firing: (13) wmf_auto_restart_ircecho.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:56:01] (SystemdUnitFailed) firing: (40) wmf_auto_restart_ircecho.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:58:13] !log power cycling restbase1033 — T333243 [21:58:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:58:18] T333243: restbase1033 is down - https://phabricator.wikimedia.org/T333243 [21:58:41] !log Deploy security fix for T326952 [21:58:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:59:11] (SystemdUnitFailed) firing: (63) wmf_auto_restart_ircecho.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:01:08] PROBLEM - cassandra-b SSL 10.64.48.152:7001 on restbase1033 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [22:01:16] PROBLEM - cassandra-c SSL 10.64.48.153:7001 on restbase1033 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [22:01:16] PROBLEM - cassandra-a SSL 10.64.48.151:7001 on restbase1033 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [22:01:34] RECOVERY - SSH on restbase1033 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [22:01:38] RECOVERY - Restbase root url on restbase1033 is OK: HTTP OK: HTTP/1.1 200 - 17255 bytes in 0.046 second response time https://wikitech.wikimedia.org/wiki/RESTBase [22:02:08] PROBLEM - cassandra-b service on restbase1033 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:02:24] PROBLEM - cassandra-c service on restbase1033 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:02:24] PROBLEM - cassandra-a service on restbase1033 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:04:04] RECOVERY - cassandra-b service on restbase1033 is OK: OK - cassandra-b is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:04:18] RECOVERY - cassandra-c service on restbase1033 is OK: OK - cassandra-c is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:04:18] RECOVERY - cassandra-a service on restbase1033 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:04:46] (03PS2) 10EoghanGaffney: Adds php and apache logs for doc machines [puppet] - 10https://gerrit.wikimedia.org/r/900375 (https://phabricator.wikimedia.org/T325245) [22:04:56] (WidespreadPuppetFailure) resolved: Puppet has failed on wcqs cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=wcqs - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [22:05:18] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [22:06:16] (03CR) 10EoghanGaffney: Adds php and apache logs for doc machines (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/900375 (https://phabricator.wikimedia.org/T325245) (owner: 10EoghanGaffney) [22:06:54] RECOVERY - cassandra-b SSL 10.64.48.152:7001 on restbase1033 is OK: SSL OK - Certificate restbase1033-b valid until 2024-08-28 11:43:21 +0000 (expires in 519 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [22:07:00] RECOVERY - cassandra-c SSL 10.64.48.153:7001 on restbase1033 is OK: SSL OK - Certificate restbase1033-c valid until 2024-08-28 11:43:23 +0000 (expires in 519 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [22:07:01] RECOVERY - cassandra-a SSL 10.64.48.151:7001 on restbase1033 is OK: SSL OK - Certificate restbase1033-a valid until 2024-08-28 11:43:18 +0000 (expires in 519 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [22:07:02] RECOVERY - cassandra-a CQL 10.64.48.151:9042 on restbase1033 is OK: TCP OK - 0.000 second response time on 10.64.48.151 port 9042 https://phabricator.wikimedia.org/T93886 [22:07:04] RECOVERY - cassandra-b CQL 10.64.48.152:9042 on restbase1033 is OK: TCP OK - 0.000 second response time on 10.64.48.152 port 9042 https://phabricator.wikimedia.org/T93886 [22:07:22] RECOVERY - cassandra-c CQL 10.64.48.153:9042 on restbase1033 is OK: TCP OK - 0.000 second response time on 10.64.48.153 port 9042 https://phabricator.wikimedia.org/T93886 [22:09:50] 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10Volans) >>! In T330165#8731601, @Stashbot wrote: > {nav icon=file, name=Mentioned in SAL (#wikimedia-operations), href=https://sal.toolforge.org/log/YxgIJY... [22:10:17] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [22:14:43] (03CR) 10Dzahn: "It's not true that this removes IRC notifications, they were just sent to a test channel only. I am fixing that here: https://gerrit.wikim" [puppet] - 10https://gerrit.wikimedia.org/r/902801 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn) [22:16:08] !log zabe@mwmaint2002:~$ mwscript extensions/Translate/scripts/moveTranslatableBundle.php --wiki metawiki "Meta:WMF Support and Safety" "Meta:WMF Trust and Safety" "Zabe" --reason "per [[:phab:T330514|T330514]]" # T330514 [22:16:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:16:15] T330514: Rename group-wmf-supportsafety contents - https://phabricator.wikimedia.org/T330514 [22:17:18] (03CR) 10Dzahn: [C: 03+2] alertmanager: send sre-collab alerts to -operations and -sre-collab [puppet] - 10https://gerrit.wikimedia.org/r/903318 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn) [22:18:06] (CirrusSearchJobQueueBacklogTooBig) firing: CirrusSearch job topic codfw.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 287.6k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig [22:19:47] (03PS1) 10Zabe: Rename "Support and Safety" to "Trust and Safety" [extensions/WikimediaMessages] (wmf/1.41.0-wmf.1) - 10https://gerrit.wikimedia.org/r/903205 (https://phabricator.wikimedia.org/T330514) [22:21:42] MessageIndexException from line 191 of /srv/mediawiki/php-1.41.0-wmf.1/extensions/Translate/utils/MessageIndex.php: MessageIndex: unable to acquire lock [22:21:46] :| [22:22:26] (WidespreadPuppetFailure) firing: Puppet has failed on bastion cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=bastion - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [22:22:27] 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops, 10netops: Please update bootp helper on pfw3-eqiad to point to frpm1002 for fundraising subnets - https://phabricator.wikimedia.org/T332939 (10Dwisehaupt) Thanks! Verified working and runs good. [22:22:54] zabe: I don't think we need to backport that to the current wmf branch, we can let the train do it when the time comes? [22:23:34] oh, well, you're moving the meta pages now - I wanted to wait a bit [22:24:05] well, we get this done now, good :) [22:24:15] is there anything specific you wanted to wait for? [22:24:55] runs puppet on bast1003 because that alert claims puppet fails on bastion "cluster" but also I dont get the graph :) [22:27:02] and nothing actually failed there.. so no idea [22:28:08] ah, it's bast5003 pushing things over the limit and the usual background ones https://puppetboard.wikimedia.org/nodes?status=failed [22:29:15] zabe: my idea was train -> watch for failures -> rename; but since you are backporting it now, I guess there's no need to wait :) [22:30:54] well :) [22:31:10] jouncebot: nowandnext [22:31:10] For the next 0 hour(s) and 28 minute(s): Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230327T2100) [22:31:10] In 3 hour(s) and 28 minute(s): Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230328T0200) [22:31:29] (03CR) 10Zabe: [C: 03+2] Rename "Support and Safety" to "Trust and Safety" [extensions/WikimediaMessages] (wmf/1.41.0-wmf.1) - 10https://gerrit.wikimedia.org/r/903205 (https://phabricator.wikimedia.org/T330514) (owner: 10Zabe) [22:42:06] (CirrusSearchNodeIndexingNotIncreasing) firing: Elasticsearch instance elastic1074-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [22:43:32] !log apt2001 - kill 3105; run puppet [22:43:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:43:45] !log stat1004 - kill 29291; run puppet [22:43:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:46:51] (03Merged) 10jenkins-bot: Rename "Support and Safety" to "Trust and Safety" [extensions/WikimediaMessages] (wmf/1.41.0-wmf.1) - 10https://gerrit.wikimedia.org/r/903205 (https://phabricator.wikimedia.org/T330514) (owner: 10Zabe) [22:47:06] (CirrusSearchNodeIndexingNotIncreasing) firing: (3) Elasticsearch instance elastic1074-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [22:47:20] !log zabe@deploy2002 Started scap: Backport for [[gerrit:903205|Rename "Support and Safety" to "Trust and Safety" (T330514)]] [22:47:26] T330514: Rename group-wmf-supportsafety contents - https://phabricator.wikimedia.org/T330514 [22:48:34] (WidespreadPuppetFailure) resolved: Puppet has failed on bastion cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=bastion - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [22:48:48] !log stat1005 - kill 18179; run puppet ; stat1007 - kill 3346; run puppet ; stat1006 - kill 23887 run puppet [22:48:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:52:06] (CirrusSearchNodeIndexingNotIncreasing) firing: (6) Elasticsearch instance elastic1056-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [22:57:06] (CirrusSearchNodeIndexingNotIncreasing) firing: (8) Elasticsearch instance elastic1056-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [23:00:01] !log zabe@deploy2002 zabe: Backport for [[gerrit:903205|Rename "Support and Safety" to "Trust and Safety" (T330514)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet [23:00:11] T330514: Rename group-wmf-supportsafety contents - https://phabricator.wikimedia.org/T330514 [23:02:06] (CirrusSearchNodeIndexingNotIncreasing) firing: (9) Elasticsearch instance elastic1056-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [23:02:11] 10SRE, 10Infrastructure Security, 10Infrastructure-Foundations: Offboard nfraison - https://phabricator.wikimedia.org/T333135 (10Dzahn) We got the "widespread puppet failures" alert which made me look at some random failed hosts in the list. I found the reason was this offboarding, because: apt2001: ` Err... [23:02:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [23:03:34] (SystemdUnitFailed) firing: (15) wmf_auto_restart_ircecho.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:07:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [23:08:34] (SystemdUnitFailed) firing: (22) wmf_auto_restart_ircecho.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:08:48] !log zabe@deploy2002 Finished scap: Backport for [[gerrit:903205|Rename "Support and Safety" to "Trust and Safety" (T330514)]] (duration: 21m 27s) [23:08:54] T330514: Rename group-wmf-supportsafety contents - https://phabricator.wikimedia.org/T330514 [23:09:26] (WidespreadPuppetFailure) firing: Puppet has failed on lvs cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=lvs - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [23:10:11] (03CR) 10Dzahn: [C: 03+2] peopleweb: replace Icinga with Prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/902801 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn) [23:13:34] (SystemdUnitFailed) firing: (60) wmf_auto_restart_ircecho.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:15:26] (03CR) 10Dzahn: [C: 03+2] "after https://gerrit.wikimedia.org/r/c/operations/puppet/+/903318 it's also not true anymore that this removes IRC notifications. they sho" [puppet] - 10https://gerrit.wikimedia.org/r/902785 (https://phabricator.wikimedia.org/T331901) (owner: 10Dzahn) [23:17:29] 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10colewhite) [23:18:15] (03CR) 10Dzahn: [C: 03+2] etherpad: replace Icinga with Prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/902783 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn) [23:21:49] 10SRE, 10ops-eqiad, 10Data-Engineering: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T333091 (10wiki_willy) a:03Jclark-ctr [23:22:06] (CirrusSearchNodeIndexingNotIncreasing) firing: (10) Elasticsearch instance elastic1055-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [23:22:35] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: Add-in Card 2 ROMB Battery LOW - https://phabricator.wikimedia.org/T332883 (10wiki_willy) a:03Jclark-ctr [23:24:23] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Two failed disks in ms-be2067 - https://phabricator.wikimedia.org/T332983 (10wiki_willy) a:03Papaul [23:24:26] (WidespreadPuppetFailure) resolved: Puppet has failed on lvs cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=lvs - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [23:24:39] 10SRE, 10SRE-Access-Requests: Grant Hal deployment rights - https://phabricator.wikimedia.org/T331647 (10Htriedman) @MoritzMuehlenhoff Sorry if this is a silly question, but I've been trying to run commands as `analytics-platform-eng` on stat machines by using `sudo -u analytics-platform-eng ...` and am b... [23:25:24] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T331373 (10wiki_willy) a:03Jhancock.wm [23:29:25] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Two failed disks in ms-be2067 - https://phabricator.wikimedia.org/T332983 (10wiki_willy) Hi guys - can we confirm the firmware is all up to date? Thanks, Willy [23:31:12] !log deployed patch for T330968 [23:31:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:33:34] (SystemdUnitFailed) firing: (58) wmf_auto_restart_ircecho.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:38:34] (SystemdUnitFailed) firing: (63) wmf_auto_restart_ircecho.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:42:22] 10SRE, 10SRE-Access-Requests: Grant Hal deployment rights - https://phabricator.wikimedia.org/T331647 (10Dzahn) Hi @Htriedman and @MoritzMuehlenhoff, the answer to this riddle is that while the special user "`analytics-platform-eng`" exists on all stat* machines, the admin group `analytics-platform-eng-admin... [23:44:30] (03CR) 10Dzahn: [C: 03+2] "working per https://thanos.wikimedia.org/graph?g0.expr=probe_success%7Binstance%3D~%22.*people.*%22%7D&g0.tab=1&g0.stacked=0&g0.range_inpu" [puppet] - 10https://gerrit.wikimedia.org/r/902801 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn) [23:44:42] 10SRE, 10SRE-Access-Requests: Grant Hal deployment rights - https://phabricator.wikimedia.org/T331647 (10Ottomata) > @Htriedman I think this comes down to a new access request like "add analytics-platform-eng-admins on stat* hosts". Or ssh to an-airflow1004 and run your sudo cmd there :) [23:47:08] !log people1003 - taking down apache to provoke monitoring alert (inactive instances) and confirm IRC alerting change works [23:47:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:50:06] jinxer-wm: jinx it [23:50:55] (ProbeDown) firing: Service people1003:443 has failed probes (http_people_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#people1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:51:02] PROBLEM - people.wikimedia.org requires authentication on people1003 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 503 Service Unavailable https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [23:51:23] oh, well, that worked but the Icinga part isnt gone [23:51:31] it was supposed to replace that [23:52:42] RECOVERY - people.wikimedia.org requires authentication on people1003 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 586 bytes in 1.049 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [23:55:50] (ProbeDown) resolved: Service people1003:443 has failed probes (http_people_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#people1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:59:07] (03CR) 10Dzahn: [C: 03+2] "confirmed this reports on IRC on both channels and also created a ticket, as desired" [puppet] - 10https://gerrit.wikimedia.org/r/902801 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn)