[02:06:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:09:28] <wikibugs>	 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T323961 (10Andrew) @Jclark-ctr it'll be another week or two before we have workloads moved off of this.
[02:26:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:29:39] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[05:10:09] <wikibugs>	 (03PS2) 10KartikMistry: Update cxserver to 2023-03-17-133444-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/900960 (https://phabricator.wikimedia.org/T332379)
[05:13:01] <wikibugs>	 (03PS3) 10Marostegui: mariadb: Promote db1101 to m1 master [puppet] - 10https://gerrit.wikimedia.org/r/902572 (https://phabricator.wikimedia.org/T331510)
[05:14:21] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db[2132,2160].codfw.wmnet,db[1101,1117,1164].eqiad.wmnet with reason: m1 master switch T331510
[05:14:27] <stashbot>	 T331510: Switchover m1 master (db1164 -> db1101) - https://phabricator.wikimedia.org/T331510
[05:14:37] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db[2132,2160].codfw.wmnet,db[1101,1117,1164].eqiad.wmnet with reason: m1 master switch T331510
[05:16:56] <kart_>	 Updating cxserver, minor changes.
[05:18:10] <wikibugs>	 (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2023-03-17-133444-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/900960 (https://phabricator.wikimedia.org/T332379) (owner: 10KartikMistry)
[05:19:04] <wikibugs>	 (03PS1) 10Marostegui: db1179: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/902902 (https://phabricator.wikimedia.org/T332292)
[05:19:35] <wikibugs>	 (03PS15) 10KartikMistry: WIP: Add new self hosted machinetranslation service (MinT) [deployment-charts] - 10https://gerrit.wikimedia.org/r/897634 (https://phabricator.wikimedia.org/T331505)
[05:19:41] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1120 T332292', diff saved to https://phabricator.wikimedia.org/P45942 and previous config saved to /var/cache/conftool/dbconfig/20230327-051941-root.json
[05:19:46] <stashbot>	 T332292: Move db1179 to x1 - https://phabricator.wikimedia.org/T332292
[05:19:53] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1179: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/902902 (https://phabricator.wikimedia.org/T332292) (owner: 10Marostegui)
[05:22:56] <wikibugs>	 (03Merged) 10jenkins-bot: Update cxserver to 2023-03-17-133444-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/900960 (https://phabricator.wikimedia.org/T332379) (owner: 10KartikMistry)
[05:23:42] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Move db1179 to x1 [puppet] - 10https://gerrit.wikimedia.org/r/902903 (https://phabricator.wikimedia.org/T332292)
[05:23:47] <logmsgbot>	 !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply
[05:24:14] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Move db1179 to x1 [puppet] - 10https://gerrit.wikimedia.org/r/902903 (https://phabricator.wikimedia.org/T332292) (owner: 10Marostegui)
[05:24:27] <logmsgbot>	 !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply
[05:28:00] <logmsgbot>	 !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/cxserver: apply
[05:28:52] <logmsgbot>	 !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply
[05:37:57] <logmsgbot>	 !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/cxserver: apply
[05:38:42] <logmsgbot>	 !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply
[05:40:49] <kart_>	 !log Updated cxserver to 2023-03-17-133444-production (T332379 + build changes)
[05:40:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:40:54] <stashbot>	 T332379: Post-creation work for anpwiki - https://phabricator.wikimedia.org/T332379
[05:57:34] <wikibugs>	 (03PS1) 10KartikMistry: Enable Section Translation on some wikis while Content Translation remains in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903003 (https://phabricator.wikimedia.org/T308834)
[06:19:47] <wikibugs>	 (03CR) 10Krinkle: Fix PHP string interpolation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868528 (https://phabricator.wikimedia.org/T314096) (owner: 10Reedy)
[06:29:39] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[06:36:42] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P45944 and previous config saved to /var/cache/conftool/dbconfig/20230327-063642-root.json
[06:40:20] <marostegui>	 !log Rename flaggedrevs tables on db1123 ptwikisource T332594
[06:40:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:40:25] <stashbot>	 T332594: Drop FlaggedRevs tables in database for ptwikisource - https://phabricator.wikimedia.org/T332594
[06:51:47] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P45945 and previous config saved to /var/cache/conftool/dbconfig/20230327-065147-root.json
[06:51:53] <marostegui>	 !log dbmaint s3 eqiad Rename flaggedrevs tables on db1123 ptwikisource T332594
[06:51:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:51:58] <stashbot>	 T332594: Drop FlaggedRevs tables in database for ptwikisource - https://phabricator.wikimedia.org/T332594
[06:54:22] <wikibugs>	 (03PS1) 10Ayounsi: Add cookbook to configure router's BGP sessions to k8s hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/903174 (https://phabricator.wikimedia.org/T306649)
[07:00:05] <jouncebot>	 Amir1, Urbanecm, and taavi: #bothumor I � Unicode. All rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230327T0700).
[07:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[07:06:49] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:06:52] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P45946 and previous config saved to /var/cache/conftool/dbconfig/20230327-070651-root.json
[07:07:57] <icinga-wm>	 PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:09:15] <wikibugs>	 (03PS1) 10Marostegui: backups: Replace db1164 with db1101 [puppet] - 10https://gerrit.wikimedia.org/r/903175 (https://phabricator.wikimedia.org/T331510)
[07:12:08] <marostegui>	 jynus: also that one ^ :)
[07:12:19] <jynus>	 oh
[07:12:29] <jynus>	 I forgot
[07:12:46] <jynus>	 needs 2 changes actually
[07:12:59] <marostegui>	 ah yes
[07:13:00] <marostegui>	 I see it
[07:13:02] <marostegui>	 let me fix it
[07:13:27] <wikibugs>	 (03PS2) 10Marostegui: backups: Replace db1164 with db1101 [puppet] - 10https://gerrit.wikimedia.org/r/903175 (https://phabricator.wikimedia.org/T331510)
[07:13:29] <marostegui>	 jynus: ^
[07:13:41] <wikibugs>	 (03PS4) 10Marostegui: mariadb: Promote db1101 to m1 master [puppet] - 10https://gerrit.wikimedia.org/r/902572 (https://phabricator.wikimedia.org/T331510)
[07:13:47] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] backups: Replace db1164 with db1101 [puppet] - 10https://gerrit.wikimedia.org/r/903175 (https://phabricator.wikimedia.org/T331510) (owner: 10Marostegui)
[07:14:06] <jynus>	 one sec because I was looking and there are backups still running
[07:14:27] <marostegui>	 sure no problem
[07:21:57] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P45947 and previous config saved to /var/cache/conftool/dbconfig/20230327-072156-root.json
[07:30:28] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] mariadb: Promote db1101 to m1 master [puppet] - 10https://gerrit.wikimedia.org/r/902572 (https://phabricator.wikimedia.org/T331510) (owner: 10Marostegui)
[07:32:29] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1101 to m1 master [puppet] - 10https://gerrit.wikimedia.org/r/902572 (https://phabricator.wikimedia.org/T331510) (owner: 10Marostegui)
[07:33:11] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 (10Joe)
[07:34:11] * urbanecm goes to do some MW deployment, since B&C is empty
[07:34:16] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] SpecialWikiSets: Avoid calling WikiSet::getId on null [extensions/CentralAuth] (wmf/1.41.0-wmf.1) - 10https://gerrit.wikimedia.org/r/902741 (https://phabricator.wikimedia.org/T333075) (owner: 10Urbanecm)
[07:34:32] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] GrowthMentors.json: Add a write-only username field [extensions/GrowthExperiments] (wmf/1.41.0-wmf.1) - 10https://gerrit.wikimedia.org/r/902734 (https://phabricator.wikimedia.org/T331444) (owner: 10Urbanecm)
[07:36:40] <wikibugs>	 (03Merged) 10jenkins-bot: SpecialWikiSets: Avoid calling WikiSet::getId on null [extensions/CentralAuth] (wmf/1.41.0-wmf.1) - 10https://gerrit.wikimedia.org/r/902741 (https://phabricator.wikimedia.org/T333075) (owner: 10Urbanecm)
[07:37:01] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P45948 and previous config saved to /var/cache/conftool/dbconfig/20230327-073701-root.json
[07:38:50] <logmsgbot>	 !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:902741|SpecialWikiSets: Avoid calling WikiSet::getId on null (T333075)]]
[07:38:58] <stashbot>	 T333075: Error: Call to a member function getId() on null - https://phabricator.wikimedia.org/T333075
[07:39:50] <jynus>	 !log disabling puppet and shutding down bacula at backup1001 T331510
[07:39:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:39:55] <stashbot>	 T331510: Switchover m1 master (db1164 -> db1101) - https://phabricator.wikimedia.org/T331510
[07:41:52] <jynus>	 a prometheus availability job will alert because of the above log, as the job only monitors that 1 host
[07:44:25] <wikibugs>	 (03PS1) 10Jcrespo: bacula: Add miscweb2003 jobs to the list of monitoring-ignored jobs [puppet] - 10https://gerrit.wikimedia.org/r/903179 (https://phabricator.wikimedia.org/T331896)
[07:46:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job bacula in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[07:48:21] <logmsgbot>	 !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:902741|SpecialWikiSets: Avoid calling WikiSet::getId on null (T333075)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet
[07:48:26] <stashbot>	 T333075: Error: Call to a member function getId() on null - https://phabricator.wikimedia.org/T333075
[07:48:39] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops, 10netops: Please update bootp helper on pfw3-eqiad to point to frpm1002 for fundraising subnets - https://phabricator.wikimedia.org/T332939 (10ayounsi) 05Open→03Resolved a:03ayounsi Done!
[07:51:39] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] bacula: Add miscweb2003 jobs to the list of monitoring-ignored jobs [puppet] - 10https://gerrit.wikimedia.org/r/903179 (https://phabricator.wikimedia.org/T331896) (owner: 10Jcrespo)
[07:52:06] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1120 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P45949 and previous config saved to /var/cache/conftool/dbconfig/20230327-075206-root.json
[07:52:55] <wikibugs>	 (03Merged) 10jenkins-bot: GrowthMentors.json: Add a write-only username field [extensions/GrowthExperiments] (wmf/1.41.0-wmf.1) - 10https://gerrit.wikimedia.org/r/902734 (https://phabricator.wikimedia.org/T331444) (owner: 10Urbanecm)
[07:55:13] <icinga-wm>	 RECOVERY - PHP7 rendering on parse2017 is OK: HTTP OK: HTTP/1.1 302 Found - 519 bytes in 0.092 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[07:55:36] <logmsgbot>	 !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:902741|SpecialWikiSets: Avoid calling WikiSet::getId on null (T333075)]] (duration: 16m 45s)
[07:55:41] <stashbot>	 T333075: Error: Call to a member function getId() on null - https://phabricator.wikimedia.org/T333075
[07:58:36] <logmsgbot>	 !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:902734|GrowthMentors.json: Add a write-only username field (T331444)]]
[07:58:41] <stashbot>	 T331444: MediaWiki:GrowthMentors.json: Add a write-only username field - https://phabricator.wikimedia.org/T331444
[07:59:58] <logmsgbot>	 !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:902734|GrowthMentors.json: Add a write-only username field (T331444)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet
[08:00:57] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2069 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:01:12] <wikibugs>	 (03CR) 10Tacsipacsi: [huwiki] Add Draft and Draft_talk namespaces (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902888 (https://phabricator.wikimedia.org/T333083) (owner: 10Superpes15)
[08:02:04] <wikibugs>	 (03PS1) 10Ladsgroup: EntityUsageTable: Mark query as read-only [extensions/Wikibase] (wmf/1.41.0-wmf.1) - 10https://gerrit.wikimedia.org/r/903186 (https://phabricator.wikimedia.org/T332941)
[08:02:27] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] EntityUsageTable: Mark query as read-only [extensions/Wikibase] (wmf/1.41.0-wmf.1) - 10https://gerrit.wikimedia.org/r/903186 (https://phabricator.wikimedia.org/T332941) (owner: 10Ladsgroup)
[08:02:53] <wikibugs>	 (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40331/console" [puppet] - 10https://gerrit.wikimedia.org/r/901576 (https://phabricator.wikimedia.org/T324659) (owner: 10Hashar)
[08:03:43] <marostegui>	 !log Failover m1 from db1164 to db1101 - T331510
[08:03:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:03:49] <stashbot>	 T331510: Switchover m1 master (db1164 -> db1101) - https://phabricator.wikimedia.org/T331510
[08:03:54] <urbanecm>	 Amir1: fyi my scap backport's just about to finish
[08:04:14] <Amir1>	 mine takes twenty minutes to merge, don't worry
[08:04:14] <marostegui>	 all done jynus 
[08:04:21] <urbanecm>	 ok
[08:04:28] <jynus>	 ok to merge the backup patches?
[08:04:47] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] backups: Replace db1164 with db1101 [puppet] - 10https://gerrit.wikimedia.org/r/903175 (https://phabricator.wikimedia.org/T331510) (owner: 10Marostegui)
[08:05:02] <marostegui>	 Etherpad looks fie
[08:05:03] <marostegui>	 fine
[08:05:35] <jynus>	 it is a bit slow for me
[08:05:53] <marostegui>	 I guess it's warming up
[08:06:04] <marostegui>	 I can open the test pad fine
[08:06:29] <logmsgbot>	 !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:902734|GrowthMentors.json: Add a write-only username field (T331444)]] (duration: 07m 52s)
[08:06:34] <stashbot>	 T331444: MediaWiki:GrowthMentors.json: Add a write-only username field - https://phabricator.wikimedia.org/T331444
[08:06:51] <jynus>	 it is ok for me now
[08:06:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[08:07:11] <jynus>	 what else to test?
[08:07:30] <marostegui>	 jynus: librenms, which also works fine for me
[08:07:31] <wikibugs>	 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10Marostegui)
[08:07:48] <jynus>	 orch is complaining about lag, I guess not real?
[08:07:53] <marostegui>	 reload :)
[08:08:18] <jynus>	 still happening
[08:08:29] * urbanecm done
[08:08:31] <marostegui>	 ah I know why
[08:09:14] <jynus>	 cleanup of the table maybe?
[08:09:32] <wikibugs>	 (03PS1) 10Marostegui: db1101: Make it master [puppet] - 10https://gerrit.wikimedia.org/r/903181 (https://phabricator.wikimedia.org/T331510)
[08:09:33] <marostegui>	 jynus: nope, this ^
[08:09:37] <jynus>	 I see
[08:09:44] <wikibugs>	 (03CR) 10Marostegui: [V: 03+2 C: 03+2] db1101: Make it master [puppet] - 10https://gerrit.wikimedia.org/r/903181 (https://phabricator.wikimedia.org/T331510) (owner: 10Marostegui)
[08:10:40] <marostegui>	 jynus: fixed!
[08:11:25] <jynus>	 looking at the original path to see why I didn't see that
[08:11:29] <jynus>	 *patch
[08:11:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[08:12:50] <jynus>	 let me run puppet on backup hosts
[08:12:52] <wikibugs>	 (03CR) 10Jelto: [V: 03+1 C: 03+1] "lgtm, left one little question in-line" [puppet] - 10https://gerrit.wikimedia.org/r/901576 (https://phabricator.wikimedia.org/T324659) (owner: 10Hashar)
[08:12:53] <jynus>	 to apply the change
[08:16:40] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10Marostegui)
[08:17:27] <wikibugs>	 (03CR) 10Ladsgroup: mediawiki: Reduce the frequency of flaggedrevs updates (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/859589 (https://phabricator.wikimedia.org/T323495) (owner: 10Ladsgroup)
[08:17:29] <wikibugs>	 (03PS1) 10Marostegui: Revert "backups: Replace db1164 with db1101" [puppet] - 10https://gerrit.wikimedia.org/r/903188
[08:17:32] * urbanecm rollouts one more change
[08:17:35] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902455 (https://phabricator.wikimedia.org/T332737) (owner: 10Urbanecm)
[08:17:39] <wikibugs>	 (03CR) 10Marostegui: [C: 04-2] "Wait for the failover date" [puppet] - 10https://gerrit.wikimedia.org/r/903188 (owner: 10Marostegui)
[08:17:59] <wikibugs>	 (03Merged) 10jenkins-bot: EntityUsageTable: Mark query as read-only [extensions/Wikibase] (wmf/1.41.0-wmf.1) - 10https://gerrit.wikimedia.org/r/903186 (https://phabricator.wikimedia.org/T332941) (owner: 10Ladsgroup)
[08:18:18] <wikibugs>	 (03Merged) 10jenkins-bot: [Growth] eswiki: Enable mentorship for 50% of newcomers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902455 (https://phabricator.wikimedia.org/T332737) (owner: 10Urbanecm)
[08:18:25] <wikibugs>	 (03PS2) 10Filippo Giunchedi: prometheus1006: depool from alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/900238 (https://phabricator.wikimedia.org/T330165)
[08:18:33] <logmsgbot>	 !log urbanecm@deploy2002 Backport cancelled.
[08:19:45] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Promote db1164 to m1 master [puppet] - 10https://gerrit.wikimedia.org/r/903182 (https://phabricator.wikimedia.org/T333123)
[08:19:58] <wikibugs>	 (03CR) 10Marostegui: [C: 04-2] "Wait for the failover date" [puppet] - 10https://gerrit.wikimedia.org/r/903182 (https://phabricator.wikimedia.org/T333123) (owner: 10Marostegui)
[08:20:54] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] Revert "backups: Replace db1164 with db1101" [puppet] - 10https://gerrit.wikimedia.org/r/903188 (owner: 10Marostegui)
[08:20:59] <logmsgbot>	 !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:903186|EntityUsageTable: Mark query as read-only (T332941)]]
[08:21:01] <wikibugs>	 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10Marostegui)
[08:21:05] <stashbot>	 T332941: Warning: SQLPlatform::isWriteQuery fallback to regex (from Wikibase EntityUsageTable) - https://phabricator.wikimedia.org/T332941
[08:23:23] <wikibugs>	 (03PS1) 10Filippo Giunchedi: wmnet: move reads to graphite2004 [dns] - 10https://gerrit.wikimedia.org/r/903185 (https://phabricator.wikimedia.org/T330165)
[08:24:59] <wikibugs>	 (03PS1) 10Filippo Giunchedi: graphite: check graphite2004 [puppet] - 10https://gerrit.wikimedia.org/r/903206 (https://phabricator.wikimedia.org/T330165)
[08:25:23] <wikibugs>	 (03CR) 10Marostegui: [C: 04-2] mariadb: Promote db1164 to m1 master [puppet] - 10https://gerrit.wikimedia.org/r/903182 (https://phabricator.wikimedia.org/T333123) (owner: 10Marostegui)
[08:25:47] <logmsgbot>	 !log urbanecm@deploy2002 Synchronized wmf-config/InitialiseSettings.php: 63dd23b5ceaba35c8d9682493dd21d99a20fc8f7: [Growth] eswiki: Enable mentorship for 50% of newcomers (T332737, T285235) (duration: 06m 09s)
[08:25:54] <stashbot>	 T285235: Activate Growth mentorship at Spanish Wikipedia - https://phabricator.wikimedia.org/T285235
[08:25:54] <stashbot>	 T332737: Increase percentage of newcomers who receive Growth mentorship at Spanish Wikipedia - https://phabricator.wikimedia.org/T332737
[08:26:40] <wikibugs>	 (03PS1) 10Filippo Giunchedi: statsd: move writes to graphite2004 [puppet] - 10https://gerrit.wikimedia.org/r/903207 (https://phabricator.wikimedia.org/T330165)
[08:26:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[08:28:10] <wikibugs>	 (03PS1) 10Filippo Giunchedi: wmnet: move writes to graphite2004 [dns] - 10https://gerrit.wikimedia.org/r/903208 (https://phabricator.wikimedia.org/T330165)
[08:28:14] <jynus>	 !log restarting bacula at backup1001 T331510
[08:28:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:28:20] <stashbot>	 T331510: Switchover m1 master (db1164 -> db1101) - https://phabricator.wikimedia.org/T331510
[08:30:09] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to production for kamila - https://phabricator.wikimedia.org/T332921 (10Ladsgroup)
[08:30:29] <logmsgbot>	 !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:903186|EntityUsageTable: Mark query as read-only (T332941)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet
[08:30:34] <stashbot>	 T332941: Warning: SQLPlatform::isWriteQuery fallback to regex (from Wikibase EntityUsageTable) - https://phabricator.wikimedia.org/T332941
[08:31:25] <wikibugs>	 (03PS1) 10Filippo Giunchedi: Failover statsd to graphite2004 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903209 (https://phabricator.wikimedia.org/T330165)
[08:31:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[08:32:29] <wikibugs>	 (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40332/console" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/901612 (https://phabricator.wikimedia.org/T322357) (owner: 10Hashar)
[08:32:31] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] "SGTM" [puppet] - 10https://gerrit.wikimedia.org/r/901551 (https://phabricator.wikimedia.org/T319372) (owner: 10Elukey)
[08:34:48] <icinga-wm>	 RECOVERY - Backup freshness on backup1001 is OK: Fresh: 125 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[08:35:43] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] k8s: Force to be explicit about k8s and calico versions [puppet] - 10https://gerrit.wikimedia.org/r/899641 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm)
[08:36:04] <wikibugs>	 (03CR) 10Elukey: k8s: Make profile::kubernetes::pki::intermediate mandatory [puppet] - 10https://gerrit.wikimedia.org/r/899642 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm)
[08:36:11] <wikibugs>	 (03PS2) 10Ayounsi: Add cookbook to configure router's BGP sessions to k8s hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/903174 (https://phabricator.wikimedia.org/T306649)
[08:36:44] <wikibugs>	 (03CR) 10Jelto: [V: 03+1 C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/901612 (https://phabricator.wikimedia.org/T322357) (owner: 10Hashar)
[08:36:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job bacula in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:38:42] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add cookbook to configure router's BGP sessions to k8s hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/903174 (https://phabricator.wikimedia.org/T306649) (owner: 10Ayounsi)
[08:39:15] <logmsgbot>	 !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:903186|EntityUsageTable: Mark query as read-only (T332941)]] (duration: 18m 15s)
[08:39:22] <stashbot>	 T332941: Warning: SQLPlatform::isWriteQuery fallback to regex (from Wikibase EntityUsageTable) - https://phabricator.wikimedia.org/T332941
[08:40:24] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] k8s: Remove 1.16 related code (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/899652 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm)
[08:40:53] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] role::kafka::jumbo::broker: enable PKI migration settings [puppet] - 10https://gerrit.wikimedia.org/r/901549 (https://phabricator.wikimedia.org/T296064) (owner: 10Elukey)
[08:43:54] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10fgiunchedi)
[08:45:18] <wikibugs>	 (03PS2) 10Hashar: wm-zuul-status: filter out non-live item [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/902705 (https://phabricator.wikimedia.org/T214068)
[08:46:46] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] prometheus1006: depool from alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/900238 (https://phabricator.wikimedia.org/T330165) (owner: 10Filippo Giunchedi)
[08:47:02] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.kafka.roll-restart-brokers for Kafka A:kafka-jumbo-eqiad cluster: Roll restart of jvm daemons.
[08:50:14] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus1006: depool from alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/900238 (https://phabricator.wikimedia.org/T330165) (owner: 10Filippo Giunchedi)
[08:51:17] <jinxer-wm>	 (KafkaBrokerUnavailable) firing: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable
[08:52:14] <godog>	 hah that was me, false alarm
[08:52:37] <godog>	 prometheus1005 was also depooled, I've repooled it now
[08:53:39] <wikibugs>	 (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40333/console" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/902132 (https://phabricator.wikimedia.org/T332804) (owner: 10Dduvall)
[08:55:02] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.dns.netbox
[08:56:17] <jinxer-wm>	 (KafkaBrokerUnavailable) resolved: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable
[08:57:02] <wikibugs>	 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10fgiunchedi)
[08:57:19] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for mw-api-int - cgoubert@cumin1001"
[08:58:24] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for mw-api-int - cgoubert@cumin1001"
[08:58:24] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[09:00:45] <wikibugs>	 (03PS1) 10Clément Goubert: mw-api-int: Add records [dns] - 10https://gerrit.wikimedia.org/r/903214 (https://phabricator.wikimedia.org/T333120)
[09:02:18] <wikibugs>	 (03CR) 10Jelto: [V: 03+1] "looks mostly good, one question in-line" [puppet] - 10https://gerrit.wikimedia.org/r/902132 (https://phabricator.wikimedia.org/T332804) (owner: 10Dduvall)
[09:02:52] <icinga-wm>	 RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:03:06] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 81, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:03:10] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] mw-api-int: Add records [dns] - 10https://gerrit.wikimedia.org/r/903214 (https://phabricator.wikimedia.org/T333120) (owner: 10Clément Goubert)
[09:03:33] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] mw-api-int: Add records [dns] - 10https://gerrit.wikimedia.org/r/903214 (https://phabricator.wikimedia.org/T333120) (owner: 10Clément Goubert)
[09:04:08] <wikibugs>	 10SRE, 10Commons, 10Traffic: Specific PNG thumbnail of SVG file is outdated / stuck (European caching cluster) - https://phabricator.wikimedia.org/T333042 (10Aklapper)
[09:06:20] <wikibugs>	 (03PS1) 10Clément Goubert: Revert "mw-api-int: Add records" [dns] - 10https://gerrit.wikimedia.org/r/903190
[09:08:25] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] Revert "mw-api-int: Add records" [dns] - 10https://gerrit.wikimedia.org/r/903190 (owner: 10Clément Goubert)
[09:12:17] <jinxer-wm>	 (KafkaBrokerUnavailable) firing: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable
[09:12:59] <wikibugs>	 (03PS1) 10Clément Goubert: mw-api-int: Add records [dns] - 10https://gerrit.wikimedia.org/r/903215 (https://phabricator.wikimedia.org/T333120)
[09:13:20] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10netops, and 2 others: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10ayounsi) I pondered multiple options for the Netbox `server_bgp` custom field, feedback from ServiceOps welcome ba...
[09:15:23] <wikibugs>	 (03PS2) 10Clément Goubert: mw-api-int: Add records [dns] - 10https://gerrit.wikimedia.org/r/903215 (https://phabricator.wikimedia.org/T333120)
[09:17:17] <jinxer-wm>	 (KafkaBrokerUnavailable) resolved: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable
[09:17:18] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] mw-api-int: Add records [dns] - 10https://gerrit.wikimedia.org/r/903215 (https://phabricator.wikimedia.org/T333120) (owner: 10Clément Goubert)
[09:17:20] <wikibugs>	 (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] mediawiki: Reduce the frequency of flaggedrevs updates (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/859589 (https://phabricator.wikimedia.org/T323495) (owner: 10Ladsgroup)
[09:18:04] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] mw-api-int: Add records [dns] - 10https://gerrit.wikimedia.org/r/903215 (https://phabricator.wikimedia.org/T333120) (owner: 10Clément Goubert)
[09:24:56] <wikibugs>	 (03PS3) 10Ayounsi: Add cookbook to configure router's BGP sessions to k8s hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/903174 (https://phabricator.wikimedia.org/T306649)
[09:25:43] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1] "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/903217 (https://phabricator.wikimedia.org/T333120) (owner: 10Clément Goubert)
[09:27:00] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add cookbook to configure router's BGP sessions to k8s hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/903174 (https://phabricator.wikimedia.org/T306649) (owner: 10Ayounsi)
[09:33:54] <wikibugs>	 (03PS4) 10Ayounsi: Add cookbook to configure router's BGP sessions to k8s hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/903174 (https://phabricator.wikimedia.org/T306649)
[09:39:55] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.dns.netbox
[09:40:10] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] "LGTM, optional nits inline." [puppet] - 10https://gerrit.wikimedia.org/r/903217 (https://phabricator.wikimedia.org/T333120) (owner: 10Clément Goubert)
[09:40:59] <wikibugs>	 (03PS1) 10Jbond: Offboard nfraison [puppet] - 10https://gerrit.wikimedia.org/r/903219 (https://phabricator.wikimedia.org/T333135)
[09:41:05] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[09:43:37] <wikibugs>	 (03PS2) 10Jbond: Offboard nfraison [puppet] - 10https://gerrit.wikimedia.org/r/903219 (https://phabricator.wikimedia.org/T333135)
[09:44:49] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 45295
[09:45:25] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] Offboard nfraison [puppet] - 10https://gerrit.wikimedia.org/r/903219 (https://phabricator.wikimedia.org/T333135) (owner: 10Jbond)
[09:45:41] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 45295
[09:46:49] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1] service_catalog: Add mw-api-int k8s service (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/903217 (https://phabricator.wikimedia.org/T333120) (owner: 10Clément Goubert)
[09:47:07] <wikibugs>	 (03PS2) 10Clément Goubert: service_catalog: Add mw-api-int k8s service [puppet] - 10https://gerrit.wikimedia.org/r/903217 (https://phabricator.wikimedia.org/T333120)
[09:47:09] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+1] P:kubernetes::node: Use performance governor [puppet] - 10https://gerrit.wikimedia.org/r/902119 (https://phabricator.wikimedia.org/T332788) (owner: 10Clément Goubert)
[09:47:13] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on kafka-main1003.eqiad.wmnet with reason: stop kafka and dist-upgrade
[09:47:26] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kafka-main1003.eqiad.wmnet with reason: stop kafka and dist-upgrade
[09:50:02] <wikibugs>	 (03PS3) 10Clément Goubert: service_catalog: Add mw-api-int k8s service [puppet] - 10https://gerrit.wikimedia.org/r/903217 (https://phabricator.wikimedia.org/T333120)
[09:50:53] <wikibugs>	 (03CR) 10Clément Goubert: service_catalog: Add mw-api-int k8s service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/903217 (https://phabricator.wikimedia.org/T333120) (owner: 10Clément Goubert)
[09:51:57] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40336/console" [puppet] - 10https://gerrit.wikimedia.org/r/903217 (https://phabricator.wikimedia.org/T333120) (owner: 10Clément Goubert)
[09:54:32] <wikibugs>	 (03PS7) 10Filippo Giunchedi: team-sre/puppet-agent: Add widespread puppet failure alert [alerts] - 10https://gerrit.wikimedia.org/r/902323 (https://phabricator.wikimedia.org/T294564) (owner: 10Jbond)
[09:54:34] <wikibugs>	 (03PS3) 10Filippo Giunchedi: team-sre/puppet-agent: Add widespread puppet failure (no resources) alert [alerts] - 10https://gerrit.wikimedia.org/r/902348 (https://phabricator.wikimedia.org/T294564) (owner: 10Jbond)
[09:57:46] <wikibugs>	 (03PS2) 10JMeybohm: k8s: Force docker storage-driver to overlay2 [puppet] - 10https://gerrit.wikimedia.org/r/902318 (https://phabricator.wikimedia.org/T332803)
[09:58:10] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] monitoring/alerting: globally replace serviceops-collab with sre-collab [puppet] - 10https://gerrit.wikimedia.org/r/902791 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn)
[09:59:29] <wikibugs>	 (03CR) 10LSobanski: [C: 04-1] "The change has not been confirmed yet so let's not jump the gun on this." [puppet] - 10https://gerrit.wikimedia.org/r/902791 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn)
[09:59:32] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] mediawiki::errorpage: rationalize usage [puppet] - 10https://gerrit.wikimedia.org/r/902446 (owner: 10Giuseppe Lavagetto)
[10:00:04] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230327T1000)
[10:02:10] <wikibugs>	 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10hnowlan)
[10:02:30] <wikibugs>	 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10ArielGlenn)
[10:02:49] <wikibugs>	 (03CR) 10Jelto: monitoring/alerting: globally replace serviceops-collab with sre-collab (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/902791 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn)
[10:03:13] <wikibugs>	 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10hnowlan)
[10:03:49] <wikibugs>	 (03PS3) 10JMeybohm: k8s: Force docker storage-driver to overlay2 [puppet] - 10https://gerrit.wikimedia.org/r/902318 (https://phabricator.wikimedia.org/T332803)
[10:03:51] <Emperor>	 !log depool ms-fe2009
[10:03:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:04:17] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] "thanks" [alerts] - 10https://gerrit.wikimedia.org/r/902323 (https://phabricator.wikimedia.org/T294564) (owner: 10Jbond)
[10:05:30] <wikibugs>	 (03Merged) 10jenkins-bot: team-sre/puppet-agent: Add widespread puppet failure alert [alerts] - 10https://gerrit.wikimedia.org/r/902323 (https://phabricator.wikimedia.org/T294564) (owner: 10Jbond)
[10:05:33] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] team-sre/puppet-agent: Add widespread puppet failure (no resources) alert (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/902348 (https://phabricator.wikimedia.org/T294564) (owner: 10Jbond)
[10:06:12] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Clement_Goubert)
[10:06:24] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 4 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 (10Clement_Goubert) 05Open→03In progress p:05Triage→03Medium a:03Clement_Goubert
[10:06:44] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] team-sre/resource: Add disk space [alerts] - 10https://gerrit.wikimedia.org/r/902457 (https://phabricator.wikimedia.org/T332764) (owner: 10Jbond)
[10:06:54] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] team-sre/resource: Add disk space [alerts] - 10https://gerrit.wikimedia.org/r/902457 (https://phabricator.wikimedia.org/T332764) (owner: 10Jbond)
[10:07:10] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Ben, does this look good to you? thanks!" [alerts] - 10https://gerrit.wikimedia.org/r/901623 (https://phabricator.wikimedia.org/T309009) (owner: 10AOkoth)
[10:08:32] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "Untested but LGTM, thank you Daniel" [puppet] - 10https://gerrit.wikimedia.org/r/902801 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn)
[10:08:44] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] releases: remove Icinga monitoring [puppet] - 10https://gerrit.wikimedia.org/r/902785 (https://phabricator.wikimedia.org/T331901) (owner: 10Dzahn)
[10:09:15] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] releases-jenkins: replace Icinga with Prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/902788 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn)
[10:09:43] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] etherpad: replace Icinga with Prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/902783 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn)
[10:10:09] <elukey>	 !log dist-upgrade kafka-main1003 manually to bullseye - T332013
[10:10:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:10:15] <stashbot>	 T332013: Migrate kafka-main to bullseye - https://phabricator.wikimedia.org/T332013
[10:13:21] <wikibugs>	 (03CR) 10Filippo Giunchedi: "LGTM modulo alert name" [alerts] - 10https://gerrit.wikimedia.org/r/902701 (https://phabricator.wikimedia.org/T332764) (owner: 10Jbond)
[10:15:18] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] team-sre/hardware: add alertmanager tests to replace check_ipmi_sensor [alerts] - 10https://gerrit.wikimedia.org/r/902754 (https://phabricator.wikimedia.org/T332764) (owner: 10Jbond)
[10:15:34] <wikibugs>	 (03CR) 10EoghanGaffney: [C: 03+1] etherpad: replace Icinga with Prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/902783 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn)
[10:17:04] <wikibugs>	 (03CR) 10EoghanGaffney: [C: 03+1] releases: remove Icinga monitoring [puppet] - 10https://gerrit.wikimedia.org/r/902785 (https://phabricator.wikimedia.org/T331901) (owner: 10Dzahn)
[10:17:52] <wikibugs>	 (03CR) 10EoghanGaffney: [C: 03+1] releases-jenkins: replace Icinga with Prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/902788 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn)
[10:20:11] <wikibugs>	 (03PS1) 10Jbond: cinga: drop nfraison from ACL's [puppet] - 10https://gerrit.wikimedia.org/r/903221 (https://phabricator.wikimedia.org/T333135)
[10:20:17] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] mediawiki::tlsproxy::yaml_defs: add error page to envoy [puppet] - 10https://gerrit.wikimedia.org/r/902447 (https://phabricator.wikimedia.org/T287983) (owner: 10Giuseppe Lavagetto)
[10:21:16] <wikibugs>	 (03CR) 10JMeybohm: k8s: Force docker storage-driver to overlay2 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/902318 (https://phabricator.wikimedia.org/T332803) (owner: 10JMeybohm)
[10:21:26] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] cinga: drop nfraison from ACL's [puppet] - 10https://gerrit.wikimedia.org/r/903221 (https://phabricator.wikimedia.org/T333135) (owner: 10Jbond)
[10:21:30] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2069 is CRITICAL: CRITICAL - degraded: The following units failed: swift_rclone_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:22:16] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] "PCC (expected to fail on alert) https://puppet-compiler.wmflabs.org/output/902318/40337/" [puppet] - 10https://gerrit.wikimedia.org/r/902318 (https://phabricator.wikimedia.org/T332803) (owner: 10JMeybohm)
[10:22:17] <jinxer-wm>	 (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster main-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=main-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[10:22:30] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] k8s: Force docker storage-driver to overlay2 [puppet] - 10https://gerrit.wikimedia.org/r/902318 (https://phabricator.wikimedia.org/T332803) (owner: 10JMeybohm)
[10:24:26] <jinxer-wm>	 (WidespreadPuppetFailure) firing: Puppet has failed on kafka_main cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=kafka_main - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[10:24:44] <wikibugs>	 10SRE, 10Infrastructure Security, 10procurement: Offboard nfraison - https://phabricator.wikimedia.org/T333135 (10jbond)
[10:24:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkJobUnstable) firing: WCQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[10:24:57] <wikibugs>	 (03CR) 10EoghanGaffney: [C: 03+1] peopleweb: replace Icinga with Prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/902801 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn)
[10:25:31] <wikibugs>	 10SRE, 10Infrastructure Security, 10Infrastructure-Foundations, 10procurement: Offboard nfraison - https://phabricator.wikimedia.org/T333135 (10jbond)
[10:27:08] <_joe_>	 jouncebot: next
[10:27:09] <jouncebot>	 In 2 hour(s) and 32 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230327T1300)
[10:27:15] <_joe_>	 jouncebot: now
[10:27:15] <jouncebot>	 For the next 0 hour(s) and 32 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230327T1000)
[10:27:17] <jinxer-wm>	 (KafkaUnderReplicatedPartitions) resolved: Under replicated partitions for Kafka cluster main-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=main-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[10:27:28] <_joe_>	 elukey: this sounds promising ^^
[10:27:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:28:00] <logmsgbot>	 !log oblivian@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[10:28:02] <elukey>	 yep all recovered :)
[10:28:39] <logmsgbot>	 !log oblivian@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
[10:28:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: Processing latency of WCQS_Streaming_Updater in eqiad (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh
[10:29:26] <jinxer-wm>	 (WidespreadPuppetFailure) resolved: Puppet has failed on kafka_main cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=kafka_main - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[10:29:40] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[10:30:55] <logmsgbot>	 !log oblivian@deploy2002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[10:31:20] <logmsgbot>	 !log oblivian@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[10:32:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:33:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: Processing latency of WCQS_Streaming_Updater in eqiad (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh
[10:34:39] <jinxer-wm>	 (NodeTextfileStale) resolved: Stale textfile for puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[10:34:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkJobUnstable) resolved: WCQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[10:35:50] <wikibugs>	 (03PS4) 10EoghanGaffney: Allow E_DEPRECATED logs to be shown on php-fpm in doc machines [puppet] - 10https://gerrit.wikimedia.org/r/900369 (https://phabricator.wikimedia.org/T325245)
[10:36:17] <jinxer-wm>	 (KafkaBrokerUnavailable) firing: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable
[10:39:22] <elukey>	 this is due to the roll restart --^
[10:39:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST events) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:41:17] <jinxer-wm>	 (KafkaBrokerUnavailable) resolved: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable
[10:41:17] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: IcingaHosts.wait_for_downtimed() does not honor dry_run - https://phabricator.wikimedia.org/T315537 (10SLyngshede-WMF) We're missing a "dry_run" for services and puppet, but Puppet doesn't need is as the decorator also checks for _remote_hosts.
[10:41:28] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] mesh.configuration: add support for custom error pages [deployment-charts] - 10https://gerrit.wikimedia.org/r/901679 (https://phabricator.wikimedia.org/T287983) (owner: 10Giuseppe Lavagetto)
[10:42:45] <wikibugs>	 10SRE, 10Infrastructure Security, 10Infrastructure-Foundations, 10procurement: Offboard nfraison - https://phabricator.wikimedia.org/T333135 (10jbond)
[10:43:18] <wikibugs>	 10SRE, 10Infrastructure Security, 10Infrastructure-Foundations, 10procurement: Offboard nfraison - https://phabricator.wikimedia.org/T333135 (10jbond) @Dzahn can you take care of password store
[10:44:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (LIST events) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:45:17] <wikibugs>	 (03PS4) 10Hnowlan: admin_ng: increase namespace cpu quota for thumbor, increase replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/899654 (https://phabricator.wikimedia.org/T328033)
[10:47:02] <wikibugs>	 (03Merged) 10jenkins-bot: mesh.configuration: add support for custom error pages [deployment-charts] - 10https://gerrit.wikimedia.org/r/901679 (https://phabricator.wikimedia.org/T287983) (owner: 10Giuseppe Lavagetto)
[10:48:05] <wikibugs>	 10SRE, 10Infrastructure Security, 10Infrastructure-Foundations, 10procurement: Offboard nfraison - https://phabricator.wikimedia.org/T333135 (10BTullis) I will take care of the HBase/Haddoop permissions and any leftover files.
[10:48:19] <wikibugs>	 (03CR) 10EoghanGaffney: [C: 03+2] Allow E_DEPRECATED logs to be shown on php-fpm in doc machines [puppet] - 10https://gerrit.wikimedia.org/r/900369 (https://phabricator.wikimedia.org/T325245) (owner: 10EoghanGaffney)
[10:52:26] <jinxer-wm>	 (WidespreadPuppetFailure) firing: Puppet has failed on bastion cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=bastion - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[10:54:14] <wikibugs>	 (03CR) 10Superpes15: [huwiki] Add Draft and Draft_talk namespaces (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902888 (https://phabricator.wikimedia.org/T333083) (owner: 10Superpes15)
[10:55:29] <wikibugs>	 (03PS6) 10Superpes15: [huwiki] Add Draft and Draft_talk namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902888 (https://phabricator.wikimedia.org/T333083)
[10:55:37] <wikibugs>	 (03PS7) 10Superpes15: [huwiki] Add Draft and Draft_talk namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902888 (https://phabricator.wikimedia.org/T333083)
[10:56:49] <icinga-wm>	 PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[10:57:03] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[10:59:26] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: IcingaHosts.wait_for_downtimed() does not honor dry_run - https://phabricator.wikimedia.org/T315537 (10SLyngshede-WMF) PuppetMaster Class needs dry_run, this can be done by letting the class inherit from RemoteHostsAdapter.  Service class should have a...
[11:01:10] <wikibugs>	 (03CR) 10Tacsipacsi: [C: 03+1] "LGTM, thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902888 (https://phabricator.wikimedia.org/T333083) (owner: 10Superpes15)
[11:02:17] <icinga-wm>	 RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[11:02:35] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 81, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[11:03:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST events) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:04:00] <wikibugs>	 (03Abandoned) 10Samtar: InitialiseSettings.php: Undeploy Phonos from afwiktionary, arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/898749 (https://phabricator.wikimedia.org/T332006) (owner: 10Samtar)
[11:06:59] <wikibugs>	 (03PS2) 10Jbond: team-sre/systemd: add Check systemd state rule [alerts] - 10https://gerrit.wikimedia.org/r/902701 (https://phabricator.wikimedia.org/T332764)
[11:07:07] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] team-sre/systemd: add Check systemd state rule [alerts] - 10https://gerrit.wikimedia.org/r/902701 (https://phabricator.wikimedia.org/T332764) (owner: 10Jbond)
[11:07:09] <wikibugs>	 (03CR) 10Jbond: team-sre/systemd: add Check systemd state rule (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/902701 (https://phabricator.wikimedia.org/T332764) (owner: 10Jbond)
[11:07:11] <wikibugs>	 (03PS5) 10Hnowlan: admin_ng: increase namespace cpu quota for thumbor, increase replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/899654 (https://phabricator.wikimedia.org/T328033)
[11:07:13] <wikibugs>	 (03PS3) 10Jbond: team-sre/systemd: add Check systemd state rule [alerts] - 10https://gerrit.wikimedia.org/r/902701 (https://phabricator.wikimedia.org/T332764)
[11:07:28] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] team-sre/hardware: add alertmanager tests to replace check_ipmi_sensor [alerts] - 10https://gerrit.wikimedia.org/r/902754 (https://phabricator.wikimedia.org/T332764) (owner: 10Jbond)
[11:07:47] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] team-sre/hardware: Add alert for sel events [alerts] - 10https://gerrit.wikimedia.org/r/902103 (https://phabricator.wikimedia.org/T253810) (owner: 10Jbond)
[11:08:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST events) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:09:54] <wikibugs>	 (03PS2) 10Jbond: team-sre/hardware: add alertmanager tests to replace check_ipmi_sensor [alerts] - 10https://gerrit.wikimedia.org/r/902754 (https://phabricator.wikimedia.org/T332764)
[11:10:20] <wikibugs>	 (03Merged) 10jenkins-bot: team-sre/hardware: Add alert for sel events [alerts] - 10https://gerrit.wikimedia.org/r/902103 (https://phabricator.wikimedia.org/T253810) (owner: 10Jbond)
[11:10:35] <wikibugs>	 10SRE, 10Infrastructure Security, 10Infrastructure-Foundations, 10procurement: Offboard nfraison - https://phabricator.wikimedia.org/T333135 (10BTullis) I have deleted most of the leftover files and moved useful to my own home directory, but I don't have permission to update the description of this ticket...
[11:11:01] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10netops, and 2 others: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10cmooney) Personally I think it's a big conceptual change to introduce a second separate automation-pipeline for th...
[11:11:27] <wikibugs>	 10SRE, 10Infrastructure Security, 10Infrastructure-Foundations, 10procurement: Offboard nfraison - https://phabricator.wikimedia.org/T333135 (10jbond)
[11:13:23] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10netops, and 2 others: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10cmooney) On the Netbox side I'm happy with the current status, or having it as a dropdown.  I think it's good to k...
[11:13:44] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/902444 (https://phabricator.wikimedia.org/T332921) (owner: 10Kamila Součková)
[11:15:37] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM cheers" [software/spicerack] - 10https://gerrit.wikimedia.org/r/902460 (owner: 10Volans)
[11:17:32] <wikibugs>	 (03PS1) 10Slyngshede: Puppet: PuppetMaster class should inherit RemoteHostsAdapter [software/spicerack] - 10https://gerrit.wikimedia.org/r/903235 (https://phabricator.wikimedia.org/T315537)
[11:19:07] <icinga-wm>	 PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[11:19:25] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[11:20:43] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] Remove l10nupdate support [puppet] - 10https://gerrit.wikimedia.org/r/896318 (owner: 10Majavah)
[11:20:58] <jbond>	 taavi: fyi merging ^^
[11:23:37] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/902449 (owner: 10Volans)
[11:24:40] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10Jelto)
[11:24:45] <icinga-wm>	 RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[11:25:03] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 81, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[11:25:37] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1001 is CRITICAL: 1.044e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad
[11:27:26] <jinxer-wm>	 (WidespreadPuppetFailure) resolved: Puppet has failed on bastion cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=bastion - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[11:34:13] <wikibugs>	 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10Jelto)
[11:36:43] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, thanks for the addition" [software/spicerack] - 10https://gerrit.wikimedia.org/r/903235 (https://phabricator.wikimedia.org/T315537) (owner: 10Slyngshede)
[11:38:26] <jinxer-wm>	 (WidespreadPuppetFailure) firing: Puppet has failed on lvs cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=lvs - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[11:38:55] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] team-sre/systemd: add Check systemd state rule [alerts] - 10https://gerrit.wikimedia.org/r/902701 (https://phabricator.wikimedia.org/T332764) (owner: 10Jbond)
[11:39:17] <jinxer-wm>	 (KafkaBrokerUnavailable) firing: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable
[11:39:21] <volans>	 jbond: did you run the logout cookbook? it seems to affect some puppet runs ^^^
[11:43:29] <wikibugs>	 (03CR) 10Filippo Giunchedi: "LGTM, modulo Ben's vote" [puppet] - 10https://gerrit.wikimedia.org/r/902703 (https://phabricator.wikimedia.org/T309009) (owner: 10Cathal Mooney)
[11:44:17] <jinxer-wm>	 (KafkaBrokerUnavailable) resolved: Kafka broker unavailable for cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaBrokerUnavailable
[11:44:24] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] team-sre/hardware: add alertmanager tests to replace check_ipmi_sensor [alerts] - 10https://gerrit.wikimedia.org/r/902754 (https://phabricator.wikimedia.org/T332764) (owner: 10Jbond)
[11:44:32] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] team-sre/hardware: add alertmanager tests to replace check_ipmi_sensor [alerts] - 10https://gerrit.wikimedia.org/r/902754 (https://phabricator.wikimedia.org/T332764) (owner: 10Jbond)
[11:45:48] <godog>	 fixing ^
[11:46:13] <wikibugs>	 (03PS3) 10Filippo Giunchedi: team-sre/hardware: add alertmanager tests to replace check_ipmi_sensor [alerts] - 10https://gerrit.wikimedia.org/r/902754 (https://phabricator.wikimedia.org/T332764) (owner: 10Jbond)
[11:46:50] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] team-sre/hardware: add alertmanager tests to replace check_ipmi_sensor [alerts] - 10https://gerrit.wikimedia.org/r/902754 (https://phabricator.wikimedia.org/T332764) (owner: 10Jbond)
[11:47:55] <wikibugs>	 (03Merged) 10jenkins-bot: team-sre/hardware: add alertmanager tests to replace check_ipmi_sensor [alerts] - 10https://gerrit.wikimedia.org/r/902754 (https://phabricator.wikimedia.org/T332764) (owner: 10Jbond)
[11:48:15] <wikibugs>	 10SRE, 10serviceops: mw2420-mw2451 service implementation tracking - https://phabricator.wikimedia.org/T326363 (10Clement_Goubert) 05Open→03Resolved
[11:48:20] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q3:rack/setup/install mw2420-mw2451 - https://phabricator.wikimedia.org/T326362 (10Clement_Goubert)
[11:55:50] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-brokers (exit_code=0) for Kafka A:kafka-jumbo-eqiad cluster: Roll restart of jvm daemons.
[11:57:05] <wikibugs>	 (03PS3) 10Clément Goubert: Add setting to make /srv/mediawiki -> /srv/mediawiki-staging on deploy servers [puppet] - 10https://gerrit.wikimedia.org/r/901667 (https://phabricator.wikimedia.org/T329857) (owner: 10Ahmon Dancy)
[12:00:15] <wikibugs>	 10SRE, 10Infrastructure Security, 10Infrastructure-Foundations, 10procurement: Offboard nfraison - https://phabricator.wikimedia.org/T333135 (10RobH) Not sure why procurement was added (so it showed up in my notifications) as this user isn't in the acl*procurement review, they are in the acl*sre-team so I...
[12:00:22] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] Add setting to make /srv/mediawiki -> /srv/mediawiki-staging on deploy servers [puppet] - 10https://gerrit.wikimedia.org/r/901667 (https://phabricator.wikimedia.org/T329857) (owner: 10Ahmon Dancy)
[12:00:24] <wikibugs>	 10SRE, 10Infrastructure Security, 10Infrastructure-Foundations, 10procurement: Offboard nfraison - https://phabricator.wikimedia.org/T333135 (10RobH) @jbond, this task isn't editable by most users (so i cannot remove the invalid project), please remove the procurement project.
[12:01:06] <wikibugs>	 (03PS1) 10Slyngshede: Service: Ensure that dry_run is parsed to dataclass. [software/spicerack] - 10https://gerrit.wikimedia.org/r/903237 (https://phabricator.wikimedia.org/T315537)
[12:06:33] <wikibugs>	 10SRE-OnFire, 10SRE-Sprint-Week-Sustainability-March2023, 10Gerrit, 10serviceops-collab, and 3 others: gerrit1001 running out of space on / - https://phabricator.wikimedia.org/T323262 (10hashar) 05Open→03Resolved a:03Clement_Goubert I have finally filled the follow up task: {T333143}  Marking this on...
[12:07:39] <wikibugs>	 (03CR) 10Jbond: [C: 04-1] "a few nits and i think an bug" [puppet] - 10https://gerrit.wikimedia.org/r/902808 (https://phabricator.wikimedia.org/T331706) (owner: 10JHathaway)
[12:08:53] <wikibugs>	 10SRE, 10Infrastructure Security, 10Infrastructure-Foundations: Offboard nfraison - https://phabricator.wikimedia.org/T333135 (10jbond)
[12:09:04] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] Puppet: PuppetMaster class should inherit RemoteHostsAdapter [software/spicerack] - 10https://gerrit.wikimedia.org/r/903235 (https://phabricator.wikimedia.org/T315537) (owner: 10Slyngshede)
[12:09:11] <wikibugs>	 (03PS1) 10Filippo Giunchedi: hieradata: move alerting_host to overlay2 [puppet] - 10https://gerrit.wikimedia.org/r/903238
[12:10:02] <wikibugs>	 (03PS2) 10Filippo Giunchedi: hieradata: move alerting_host to overlay2 [puppet] - 10https://gerrit.wikimedia.org/r/903238 (https://phabricator.wikimedia.org/T329939)
[12:12:40] <wikibugs>	 (03Merged) 10jenkins-bot: Puppet: PuppetMaster class should inherit RemoteHostsAdapter [software/spicerack] - 10https://gerrit.wikimedia.org/r/903235 (https://phabricator.wikimedia.org/T315537) (owner: 10Slyngshede)
[12:13:01] <wikibugs>	 (03PS1) 10EoghanGaffney: Set log format to ecs on doc hosts [puppet] - 10https://gerrit.wikimedia.org/r/903239 (https://phabricator.wikimedia.org/T325245)
[12:13:18] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Set log format to ecs on doc hosts [puppet] - 10https://gerrit.wikimedia.org/r/903239 (https://phabricator.wikimedia.org/T325245) (owner: 10EoghanGaffney)
[12:13:24] <wikibugs>	 (03PS2) 10EoghanGaffney: Set log format to ecs on doc hosts [puppet] - 10https://gerrit.wikimedia.org/r/903239 (https://phabricator.wikimedia.org/T325245)
[12:13:26] <jinxer-wm>	 (WidespreadPuppetFailure) resolved: Puppet has failed on lvs cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=lvs - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[12:15:54] <wikibugs>	 (03CR) 10JMeybohm: k8s: Remove 1.16 related code (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/899652 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm)
[12:15:58] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] team-sre/systemd: add Check systemd state rule [alerts] - 10https://gerrit.wikimedia.org/r/902701 (https://phabricator.wikimedia.org/T332764) (owner: 10Jbond)
[12:17:09] <wikibugs>	 (03Merged) 10jenkins-bot: team-sre/systemd: add Check systemd state rule [alerts] - 10https://gerrit.wikimedia.org/r/902701 (https://phabricator.wikimedia.org/T332764) (owner: 10Jbond)
[12:17:16] <wikibugs>	 (03PS1) 10Jbond: Revert "team-sre/hardware: Add alert for sel events" [alerts] - 10https://gerrit.wikimedia.org/r/903192
[12:17:25] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Revert "team-sre/hardware: Add alert for sel events" [alerts] - 10https://gerrit.wikimedia.org/r/903192 (owner: 10Jbond)
[12:17:28] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] Revert "team-sre/hardware: Add alert for sel events" [alerts] - 10https://gerrit.wikimedia.org/r/903192 (owner: 10Jbond)
[12:17:41] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Revert "team-sre/hardware: Add alert for sel events" [alerts] - 10https://gerrit.wikimedia.org/r/903192 (owner: 10Jbond)
[12:19:10] <wikibugs>	 (03PS2) 10Jbond: Revert "team-sre/hardware: Add alert for sel events" [alerts] - 10https://gerrit.wikimedia.org/r/903192
[12:19:52] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40338/console" [puppet] - 10https://gerrit.wikimedia.org/r/903238 (https://phabricator.wikimedia.org/T329939) (owner: 10Filippo Giunchedi)
[12:19:54] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] Revert "team-sre/hardware: Add alert for sel events" [alerts] - 10https://gerrit.wikimedia.org/r/903192 (owner: 10Jbond)
[12:20:48] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1 C: 03+1] hieradata: move alerting_host to overlay2 [puppet] - 10https://gerrit.wikimedia.org/r/903238 (https://phabricator.wikimedia.org/T329939) (owner: 10Filippo Giunchedi)
[12:21:06] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: move alerting_host to overlay2 [puppet] - 10https://gerrit.wikimedia.org/r/903238 (https://phabricator.wikimedia.org/T329939) (owner: 10Filippo Giunchedi)
[12:21:44] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "team-sre/hardware: Add alert for sel events" [alerts] - 10https://gerrit.wikimedia.org/r/903192 (owner: 10Jbond)
[12:22:26] <jinxer-wm>	 (WidespreadPuppetFailure) firing: Puppet has failed on bastion cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=bastion - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[12:23:06] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1 C: 03+2] k8s: Make profile::kubernetes::pki::intermediate mandatory [puppet] - 10https://gerrit.wikimedia.org/r/899642 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm)
[12:23:42] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1 C: 03+2] k8s: Force to be explicit about k8s and calico versions [puppet] - 10https://gerrit.wikimedia.org/r/899641 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm)
[12:27:03] <wikibugs>	 (03CR) 10Jbond: "lgtm but im not sure we need this in the service class, the alertmanager instance is already set correctly which from what i see is the on" [software/spicerack] - 10https://gerrit.wikimedia.org/r/903237 (https://phabricator.wikimedia.org/T315537) (owner: 10Slyngshede)
[12:32:36] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40339/console" [puppet] - 10https://gerrit.wikimedia.org/r/902318 (https://phabricator.wikimedia.org/T332803) (owner: 10JMeybohm)
[12:36:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:40:40] <wikibugs>	 (03PS4) 10JMeybohm: k8s: Force docker storage-driver to overlay2 [puppet] - 10https://gerrit.wikimedia.org/r/902318 (https://phabricator.wikimedia.org/T332803)
[12:41:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:42:34] <godog>	 !log flip alert* to overlay2 - T329939
[12:42:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:42:40] <stashbot>	 T329939: alert hosts short of root disk space / docker devicemapper vs overlayfs - https://phabricator.wikimedia.org/T329939
[12:46:36] <jinxer-wm>	 (SystemdUnitFailed) firing: (4) generate_os_reports.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:47:01] <jinxer-wm>	 (SystemdUnitFailed) firing: (5) sync_check_icinga_contacts.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:49:01] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/902318 (https://phabricator.wikimedia.org/T332803) (owner: 10JMeybohm)
[12:49:09] <wikibugs>	 (03CR) 10Hashar: releases-jenkins: replace Icinga with Prometheus monitoring (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/902788 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn)
[12:50:17] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] k8s: Force docker storage-driver to overlay2 [puppet] - 10https://gerrit.wikimedia.org/r/902318 (https://phabricator.wikimedia.org/T332803) (owner: 10JMeybohm)
[12:50:44] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] k8s: Force docker storage-driver to overlay2 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/902318 (https://phabricator.wikimedia.org/T332803) (owner: 10JMeybohm)
[12:51:31] <jinxer-wm>	 (SystemdUnitFailed) firing: (15) sync_check_icinga_contacts.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:51:32] <jinxer-wm>	 (SystemdUnitFailed) firing: (10) generate_os_reports.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:52:56] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] "Awesome! Feel free to deploy at any time. If Apache2 needs to be restarted that can be done at anytime (the impact is minimal, it is simpl" [puppet] - 10https://gerrit.wikimedia.org/r/903239 (https://phabricator.wikimedia.org/T325245) (owner: 10EoghanGaffney)
[12:57:26] <jinxer-wm>	 (WidespreadPuppetFailure) resolved: Puppet has failed on bastion cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=bastion - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[12:58:50] <wikibugs>	 (03PS1) 10Btullis: Upgrade the research airflow instance [puppet] - 10https://gerrit.wikimedia.org/r/903242 (https://phabricator.wikimedia.org/T326193)
[12:59:26] <jinxer-wm>	 (WidespreadPuppetFailure) firing: Puppet has failed on thumbor cluster - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=thumbor - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[13:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: May I have your attention please! UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230327T1300)
[13:00:05] <jouncebot>	 Superpes: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:14] <taavi>	 o/ I can deploy
[13:00:20] <Superpes>	 Hi taavi :)
[13:00:37] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40340/console" [puppet] - 10https://gerrit.wikimedia.org/r/903242 (https://phabricator.wikimedia.org/T326193) (owner: 10Btullis)
[13:02:17] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902888 (https://phabricator.wikimedia.org/T333083) (owner: 10Superpes15)
[13:03:08] <wikibugs>	 (03Merged) 10jenkins-bot: [huwiki] Add Draft and Draft_talk namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902888 (https://phabricator.wikimedia.org/T333083) (owner: 10Superpes15)
[13:03:32] <logmsgbot>	 !log taavi@deploy2002 Started scap: Backport for [[gerrit:902888|[huwiki] Add Draft and Draft_talk namespaces (T333083)]]
[13:03:38] <stashbot>	 T333083: Requesting Draft namespace for hu.wikipedia - https://phabricator.wikimedia.org/T333083
[13:03:55] <wikibugs>	 (03CR) 10Hashar: "To clarify: +1 overall,  the remarks I have made in the diff comment can be implemented or ruled out later ;)" [puppet] - 10https://gerrit.wikimedia.org/r/902788 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn)
[13:04:14] <wikibugs>	 (03PS2) 10Slyngshede: Service: Ensure that dry_run is passed to dataclass. [software/spicerack] - 10https://gerrit.wikimedia.org/r/903237 (https://phabricator.wikimedia.org/T315537)
[13:04:58] <logmsgbot>	 !log taavi@deploy2002 superpes and taavi: Backport for [[gerrit:902888|[huwiki] Add Draft and Draft_talk namespaces (T333083)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet
[13:05:02] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10WMF-Legal, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T333157 (10oleksandr_tsyba_WMDE) As the #WMF-Legal project tag was added to this task, some general information to avoid wrong expectations: Please note that public...
[13:05:02] <taavi>	 Superpes: please test
[13:05:06] <Superpes>	 Looking 
[13:05:19] <wikibugs>	 (03CR) 10Slyngshede: Service: Ensure that dry_run is passed to dataclass. (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/903237 (https://phabricator.wikimedia.org/T315537) (owner: 10Slyngshede)
[13:05:32] <wikibugs>	 (03PS3) 10Slyngshede: Service: Ensure that dry_run is passed to dataclass. [software/spicerack] - 10https://gerrit.wikimedia.org/r/903237 (https://phabricator.wikimedia.org/T315537)
[13:06:07] <Superpes>	 Looks fine thanks :) taavi
[13:07:26] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T333157 (10WMDE-leszek)
[13:07:56] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T333157 (10WMDE-leszek) Cleaned up the tags a bit, apologies @oleksandr_tsyba_WMDE. we have used a wrong template, again
[13:08:07] <wikibugs>	 (03CR) 10Jbond: "thanks" [software/spicerack] - 10https://gerrit.wikimedia.org/r/903237 (https://phabricator.wikimedia.org/T315537) (owner: 10Slyngshede)
[13:08:40] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T333157 (10WMDE-leszek) On that note, I endorse this request on WMDE's end.
[13:11:31] <jinxer-wm>	 (SystemdUnitFailed) firing: (19) sync_check_icinga_contacts.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:12:17] <logmsgbot>	 !log taavi@deploy2002 Finished scap: Backport for [[gerrit:902888|[huwiki] Add Draft and Draft_talk namespaces (T333083)]] (duration: 08m 45s)
[13:12:23] <stashbot>	 T333083: Requesting Draft namespace for hu.wikipedia - https://phabricator.wikimedia.org/T333083
[13:12:29] <taavi>	 done!
[13:13:44] <Superpes>	 Thanks taavi (maybe you have to run NamespaceDupes.php) :)
[13:13:51] <taavi>	 ohhhh right
[13:13:52] <taavi>	 a sec
[13:14:26] <jinxer-wm>	 (WidespreadPuppetFailure) resolved: Puppet has failed on thumbor cluster - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=thumbor - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[13:16:31] <jinxer-wm>	 (SystemdUnitFailed) firing: (64) sync_check_icinga_contacts.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:17:29] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, thanks!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/903237 (https://phabricator.wikimedia.org/T315537) (owner: 10Slyngshede)
[13:18:01] <taavi>	 hm I suspect the script might be broken, it's just printing the same few pagelinks rows over and over again
[13:18:05] <taavi>	 Amir1: ^ any clues why?
[13:18:18] <wikibugs>	 (03PS1) 10Elukey: Move kafka-jumbo1001's kafka broker to PKI certs [puppet] - 10https://gerrit.wikimedia.org/r/903245 (https://phabricator.wikimedia.org/T296064)
[13:19:18] <wikibugs>	 (03PS1) 10Ssingh: hiera: temporarily removed dns1003 from authdns_servers [puppet] - 10https://gerrit.wikimedia.org/r/903246 (https://phabricator.wikimedia.org/T330165)
[13:19:44] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40341/console" [puppet] - 10https://gerrit.wikimedia.org/r/903245 (https://phabricator.wikimedia.org/T296064) (owner: 10Elukey)
[13:20:23] <taavi>	 looking at wmf.1 changelog I don't see anything helpful
[13:24:17] <Amir1>	 sorry I was having lunch, let me check
[13:24:22] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops, 10Patch-For-Review: Represent sub-interface and bridge device assocations in Netbox - https://phabricator.wikimedia.org/T296832 (10cmooney) Hoping to kick-start some more discussion around this and try to close this out.  I still firmly believe tha...
[13:24:50] <taavi>	 basically namespaceDupes seems to not update the WHERE condition when printing the list of pagelinks rows it would need to update
[13:24:54] <Amir1>	 yeah, NameSpacesDupes is broken
[13:25:03] <zabe>	 I got the same issue a week or so age (sorry, forgot to create a task), but it didn't show up when running with --fix
[13:25:06] <taavi>	 I don't know if it's pagelinks specific or a wider issue
[13:27:11] <TheresNoTime>	 oh, I got that in https://phabricator.wikimedia.org/P45894 too, about a week ago
[13:27:16] <Amir1>	 yeah it's broken, file a task and I'll take a look
[13:27:19] <wikibugs>	 (03PS1) 10Andrew Bogott: clouddumps: make clouddumps1002 the primary during switch maintenance [puppet] - 10https://gerrit.wikimedia.org/r/903249 (https://phabricator.wikimedia.org/T330165)
[13:27:47] <Amir1>	 it used to cause data corruption, I'm fine with the current state to be honest
[13:27:56] <elukey>	 hey folks lemme know when the backport window close (no rush), after that I'll start some maintenance to redis misc clusters
[13:28:02] <elukey>	 *closes
[13:28:33] <taavi>	 elukey: we're debugging a maintenance script, might take a while
[13:29:14] <taavi>	 Amir1: yeah I think I'd prefer leaving some broken rows for now over blindly running with --fix
[13:29:25] <Amir1>	 I don't think we can fix the issue right now
[13:29:52] <Amir1>	 let it be, links tables always have some sorta drifts
[13:30:14] <Amir1>	 my hope would be to do the important fixes and the links one as an argument 
[13:30:16] <Amir1>	 but meh
[13:30:38] <taavi>	 hm
[13:31:00] <taavi>	 although this is breaking access to those actual pages
[13:32:14] <taavi>	 so I don't want to leave that broken either
[13:34:51] <wikibugs>	 (03CR) 10Btullis: [V: 03+1 C: 03+2] Upgrade the research airflow instance [puppet] - 10https://gerrit.wikimedia.org/r/903242 (https://phabricator.wikimedia.org/T326193) (owner: 10Btullis)
[13:35:23] <logmsgbot>	 !log fab@deploy2002 Started deploy [airflow-dags/research@d2c115d]: (no justification provided)
[13:35:44] <logmsgbot>	 !log fab@deploy2002 Finished deploy [airflow-dags/research@d2c115d]: (no justification provided) (duration: 00m 21s)
[13:36:31] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] rbd2backy2: clean up a debugging line [puppet] - 10https://gerrit.wikimedia.org/r/900648 (owner: 10Andrew Bogott)
[13:37:12] <Amir1>	 taavi: can you just comment out the links updates in maint script and re-run it?
[13:40:37] <taavi>	 Amir1: I think I found the issue
[13:41:26] <taavi>	 https://gerrit.wikimedia.org/r/c/mediawiki/core/+/829293 should have removed the addQuotes() calls from namespaceDupes.php as buildComparison does it for you
[13:41:51] <taavi>	 patch incoming
[13:45:07] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops, 10Patch-For-Review: Represent sub-interface and bridge device assocations in Netbox - https://phabricator.wikimedia.org/T296832 (10ayounsi) Overall I agree it's an improvement to have the parent interfaces defined in Netbox.  I lost a bit context o...
[13:46:14] <taavi>	 https://gerrit.wikimedia.org/r/c/mediawiki/core/+/903253
[13:46:44] <wikibugs>	 (03PS1) 10Btullis: Remove stray referece to ariflow db from research instance [puppet] - 10https://gerrit.wikimedia.org/r/903254 (https://phabricator.wikimedia.org/T326193)
[13:47:57] <Amir1>	 thanks for catching it
[13:48:24] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40342/console" [puppet] - 10https://gerrit.wikimedia.org/r/903254 (https://phabricator.wikimedia.org/T326193) (owner: 10Btullis)
[13:48:50] <wikibugs>	 (03PS1) 10Majavah: namespaceDupes: Remove extra addQuotes() calls [core] (wmf/1.41.0-wmf.1) - 10https://gerrit.wikimedia.org/r/903194 (https://phabricator.wikimedia.org/T333166)
[13:49:15] <wikibugs>	 (03CR) 10Btullis: [V: 03+1 C: 03+2] Remove stray referece to ariflow db from research instance [puppet] - 10https://gerrit.wikimedia.org/r/903254 (https://phabricator.wikimedia.org/T326193) (owner: 10Btullis)
[13:49:21] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy2002 using scap backport" [core] (wmf/1.41.0-wmf.1) - 10https://gerrit.wikimedia.org/r/903194 (https://phabricator.wikimedia.org/T333166) (owner: 10Majavah)
[13:50:37] <icinga-wm>	 RECOVERY - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1001 is OK: (C)1e+05 gt (W)1e+04 gt 2 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad
[13:53:11] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] dumps: properly absent enterprise timers [puppet] - 10https://gerrit.wikimedia.org/r/902833 (owner: 10Majavah)
[13:55:58] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 (10JJMC89)
[13:58:22] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops, 10Patch-For-Review: Represent sub-interface and bridge device assocations in Netbox - https://phabricator.wikimedia.org/T296832 (10cmooney) >>! In T296832#8729090, @ayounsi wrote: > I lost a bit context on how it will be done on a day to bay basis,...
[13:58:39] <wikibugs>	 (03PS1) 10Majavah: hieradata: swap eqiad1 dns server order [puppet] - 10https://gerrit.wikimedia.org/r/903257
[13:58:41] <wikibugs>	 (03PS1) 10Majavah: hieradata: remove unused keys from labsdnsconfig [puppet] - 10https://gerrit.wikimedia.org/r/903258
[14:00:21] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/903239 (https://phabricator.wikimedia.org/T325245) (owner: 10EoghanGaffney)
[14:00:58] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] hieradata: swap eqiad1 dns server order [puppet] - 10https://gerrit.wikimedia.org/r/903257 (owner: 10Majavah)
[14:01:22] <wikibugs>	 (03CR) 10EoghanGaffney: [C: 03+2] Set log format to ecs on doc hosts [puppet] - 10https://gerrit.wikimedia.org/r/903239 (https://phabricator.wikimedia.org/T325245) (owner: 10EoghanGaffney)
[14:01:25] <wikibugs>	 (03PS1) 10Majavah: Point openstack.eqiad1 CNAME to cloudcontrol1007 [dns] - 10https://gerrit.wikimedia.org/r/903259
[14:02:07] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] Revert "Revert: Remove the .Values.kubernetesApi hack" [deployment-charts] - 10https://gerrit.wikimedia.org/r/902078 (owner: 10Giuseppe Lavagetto)
[14:04:57] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] Service: Ensure that dry_run is passed to dataclass. [software/spicerack] - 10https://gerrit.wikimedia.org/r/903237 (https://phabricator.wikimedia.org/T315537) (owner: 10Slyngshede)
[14:05:59] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Point openstack.eqiad1 CNAME to cloudcontrol1007 [dns] - 10https://gerrit.wikimedia.org/r/903259 (owner: 10Majavah)
[14:06:20] <wikibugs>	 (03Merged) 10jenkins-bot: namespaceDupes: Remove extra addQuotes() calls [core] (wmf/1.41.0-wmf.1) - 10https://gerrit.wikimedia.org/r/903194 (https://phabricator.wikimedia.org/T333166) (owner: 10Majavah)
[14:06:36] <logmsgbot>	 !log taavi@deploy2002 Started scap: Backport for [[gerrit:903194|namespaceDupes: Remove extra addQuotes() calls (T333166)]]
[14:06:43] <stashbot>	 T333166: namespaceDupes is broken - https://phabricator.wikimedia.org/T333166
[14:07:17] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Revert: Remove the .Values.kubernetesApi hack" [deployment-charts] - 10https://gerrit.wikimedia.org/r/902078 (owner: 10Giuseppe Lavagetto)
[14:08:00] <logmsgbot>	 !log taavi@deploy2002 taavi: Backport for [[gerrit:903194|namespaceDupes: Remove extra addQuotes() calls (T333166)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet
[14:08:03] <wikibugs>	 (03PS1) 10Hashar: gerrit: set gitiles clone url to http (Gerrit 3.6.2) [puppet] - 10https://gerrit.wikimedia.org/r/903260 (https://phabricator.wikimedia.org/T206049)
[14:09:18] <wikibugs>	 (03Merged) 10jenkins-bot: Service: Ensure that dry_run is passed to dataclass. [software/spicerack] - 10https://gerrit.wikimedia.org/r/903237 (https://phabricator.wikimedia.org/T315537) (owner: 10Slyngshede)
[14:10:57] <elukey>	 jouncebot: next
[14:10:57] <jouncebot>	 In 1 hour(s) and 19 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230327T1530)
[14:11:19] <taavi>	 elukey: give me just a few more minutes please
[14:12:13] <elukey>	 sure, I was just checking next windows :)
[14:14:39] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
[14:14:39] <logmsgbot>	 !log oblivian@deploy2002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply
[14:14:51] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
[14:15:04] <logmsgbot>	 !log taavi@deploy2002 Finished scap: Backport for [[gerrit:903194|namespaceDupes: Remove extra addQuotes() calls (T333166)]] (duration: 08m 27s)
[14:15:09] <logmsgbot>	 !log oblivian@deploy2002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply
[14:15:09] <stashbot>	 T333166: namespaceDupes is broken - https://phabricator.wikimedia.org/T333166
[14:15:32] <wikibugs>	 (03CR) 10JHathaway: [C: 03+1] "Thanks for removing this cruft, looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/899652 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm)
[14:16:03] <taavi>	 !log taavi@mwmaint2002 ~ $ mwscript namespaceDupes.php --wiki=huwiki  --fix # T333083
[14:16:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:16:08] <stashbot>	 T333083: Requesting Draft namespace for hu.wikipedia - https://phabricator.wikimedia.org/T333083
[14:16:09] <taavi>	 elukey: all done!
[14:16:20] <elukey>	 nice thanks!
[14:16:31] <jinxer-wm>	 (SystemdUnitFailed) firing: (71) sync_check_icinga_contacts.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:16:59] <Superpes>	 Wow wonderful taavi
[14:17:09] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
[14:17:16] <Superpes>	 Thanks :)
[14:17:17] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
[14:21:08] <wikibugs>	 (03PS2) 10Andrew Bogott: clouddumps: make clouddumps1002 the primary during switch maintenance [puppet] - 10https://gerrit.wikimedia.org/r/903249 (https://phabricator.wikimedia.org/T330165)
[14:21:10] <wikibugs>	 (03PS1) 10Andrew Bogott: Mark cloudvirt1017, 1021 and 1022 as spare systems. [puppet] - 10https://gerrit.wikimedia.org/r/903263 (https://phabricator.wikimedia.org/T333169)
[14:21:25] <wikibugs>	 (03PS8) 10Giuseppe Lavagetto: charts: upgrade to mesh 1.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/901769 (https://phabricator.wikimedia.org/T287983)
[14:21:32] <jinxer-wm>	 (SystemdUnitFailed) firing: (73) sync_check_icinga_contacts.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:24:26] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] clouddumps: make clouddumps1002 the primary during switch maintenance [puppet] - 10https://gerrit.wikimedia.org/r/903249 (https://phabricator.wikimedia.org/T330165) (owner: 10Andrew Bogott)
[14:24:53] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Mark cloudvirt1017, 1021 and 1022 as spare systems. [puppet] - 10https://gerrit.wikimedia.org/r/903263 (https://phabricator.wikimedia.org/T333169) (owner: 10Andrew Bogott)
[14:27:22] <wikibugs>	 (03PS1) 10EoghanGaffney: Assign insetup role to new aphlict vm [puppet] - 10https://gerrit.wikimedia.org/r/903264 (https://phabricator.wikimedia.org/T322369)
[14:27:28] <wikibugs>	 (03PS1) 10Slyngshede: P:url_downloader send Squid access logs to Logstash [puppet] - 10https://gerrit.wikimedia.org/r/903265
[14:27:45] <logmsgbot>	 !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/api-gateway: sync
[14:28:05] <logmsgbot>	 !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: sync
[14:28:14] <logmsgbot>	 !log elukey@deploy2002 helmfile [codfw] START helmfile.d/services/api-gateway: sync
[14:28:33] <logmsgbot>	 !log elukey@deploy2002 helmfile [codfw] DONE helmfile.d/services/api-gateway: sync
[14:28:57] <logmsgbot>	 !log elukey@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop: sync
[14:29:13] <wikibugs>	 (03PS1) 10Bking: rdf-streaming-updater: use correct config path for dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/903266 (https://phabricator.wikimedia.org/T328675)
[14:29:14] <logmsgbot>	 !log elukey@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop: sync
[14:29:23] <logmsgbot>	 !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop: sync
[14:29:39] <logmsgbot>	 !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop: sync
[14:29:45] <wikibugs>	 (03CR) 10DCausse: [C: 03+1] rdf-streaming-updater: use correct config path for dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/903266 (https://phabricator.wikimedia.org/T328675) (owner: 10Bking)
[14:30:04] <wikibugs>	 (03CR) 10Bking: [C: 03+2] rdf-streaming-updater: use correct config path for dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/903266 (https://phabricator.wikimedia.org/T328675) (owner: 10Bking)
[14:30:25] <logmsgbot>	 !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: sync
[14:33:44] <wikibugs>	 (03PS2) 10Slyngshede: P:url_downloader send Squid access logs to Logstash [puppet] - 10https://gerrit.wikimedia.org/r/903265
[14:34:52] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40344/console" [puppet] - 10https://gerrit.wikimedia.org/r/903265 (owner: 10Slyngshede)
[14:35:45] <wikibugs>	 (03Merged) 10jenkins-bot: rdf-streaming-updater: use correct config path for dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/903266 (https://phabricator.wikimedia.org/T328675) (owner: 10Bking)
[14:38:31] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] changeprop-jobqueue: Double resource quotas [deployment-charts] - 10https://gerrit.wikimedia.org/r/902067 (owner: 10Alexandros Kosiaris)
[14:39:21] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
[14:39:28] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
[14:40:16] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
[14:40:32] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "During sprint-week I noticed that we're not collecting Squid access logs from the urldownload servers." [puppet] - 10https://gerrit.wikimedia.org/r/903265 (owner: 10Slyngshede)
[14:40:34] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
[14:40:56] <logmsgbot>	 !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: sync
[14:41:56] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to production for kamila - https://phabricator.wikimedia.org/T332921 (10MNadrofsky) Approved.
[14:43:28] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply
[14:43:30] <wikibugs>	 (03PS1) 10Andrew Bogott: Revert "Mark cloudvirt1017, 1021 and 1022 as spare systems." [puppet] - 10https://gerrit.wikimedia.org/r/903268 (https://phabricator.wikimedia.org/T333169)
[14:43:59] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply
[14:44:29] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
[14:44:55] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply
[14:45:06] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
[14:45:15] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply
[14:46:27] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply
[14:46:35] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply
[14:46:37] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Revert "Mark cloudvirt1017, 1021 and 1022 as spare systems." [puppet] - 10https://gerrit.wikimedia.org/r/903268 (https://phabricator.wikimedia.org/T333169) (owner: 10Andrew Bogott)
[14:47:41] <logmsgbot>	 !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop: sync
[14:47:55] <logmsgbot>	 !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop: sync
[14:48:05] <logmsgbot>	 !log elukey@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop: sync
[14:48:15] <logmsgbot>	 !log elukey@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop: sync
[14:52:34] <logmsgbot>	 !log eoghan@cumin1001 START - Cookbook sre.ganeti.makevm for new host aphlict1002.eqiad.wmnet
[14:52:35] <logmsgbot>	 !log eoghan@cumin1001 START - Cookbook sre.dns.netbox
[14:53:41] <wikibugs>	 10SRE-Access-Requests, 10Lift-Wing, 10Machine-Learning-Team: Machine Learning team -  k8s resources ccess - https://phabricator.wikimedia.org/T333174 (10isarantopoulos)
[14:55:03] <logmsgbot>	 !log eoghan@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM aphlict1002.eqiad.wmnet - eoghan@cumin1001"
[14:55:07] <wikibugs>	 10SRE-Access-Requests, 10Lift-Wing, 10Machine-Learning-Team: Machine Learning team -  k8s resources ccess - https://phabricator.wikimedia.org/T333174 (10isarantopoulos)
[14:56:07] <logmsgbot>	 !log eoghan@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM aphlict1002.eqiad.wmnet - eoghan@cumin1001"
[14:56:07] <logmsgbot>	 !log eoghan@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:56:07] <logmsgbot>	 !log eoghan@cumin1001 START - Cookbook sre.dns.wipe-cache aphlict1002.eqiad.wmnet on all recursors
[14:56:10] <logmsgbot>	 !log eoghan@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) aphlict1002.eqiad.wmnet on all recursors
[14:57:30] <wikibugs>	 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Trizek-WMF)
[14:57:42] <wikibugs>	 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jan-Mar-2023), 10Datacenter-Switchover: CommRel support for March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T328287 (10Trizek-WMF) 05In progress→03Resolved A post-action document has been created. There is nothing special to highl...
[14:57:50] <wikibugs>	 (03CR) 10Jbond: "see inline" [puppet] - 10https://gerrit.wikimedia.org/r/903265 (owner: 10Slyngshede)
[14:58:18] <wikibugs>	 (03PS4) 10Samtar: deployment-prep: update prometheus host to prometheus05 [puppet] - 10https://gerrit.wikimedia.org/r/868510 (https://phabricator.wikimedia.org/T324782)
[15:01:31] <jinxer-wm>	 (SystemdUnitFailed) firing: (16) sync_check_icinga_contacts.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:05:52] <logmsgbot>	 !log eoghan@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host aphlict1002.eqiad.wmnet
[15:11:31] <jinxer-wm>	 (SystemdUnitFailed) firing: (16) sync_check_icinga_contacts.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:11:49] <wikibugs>	 (03PS10) 10Jbond: sre.SREDnsDiscBatchRunnerBase: add new base class for dnsdisc services [cookbooks] - 10https://gerrit.wikimedia.org/r/849130
[15:11:52] <wikibugs>	 (03PS20) 10Jbond: sre.puppetboard.restart-reboot: create a reboot book for puppetboard [cookbooks] - 10https://gerrit.wikimedia.org/r/849093
[15:14:02] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.SREDnsDiscBatchRunnerBase: add new base class for dnsdisc services [cookbooks] - 10https://gerrit.wikimedia.org/r/849130 (owner: 10Jbond)
[15:14:14] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.puppetboard.restart-reboot: create a reboot book for puppetboard [cookbooks] - 10https://gerrit.wikimedia.org/r/849093 (owner: 10Jbond)
[15:15:17] <wikibugs>	 (03PS1) 10Vgutierrez: varnish: Bypass ATS for esitest requests [puppet] - 10https://gerrit.wikimedia.org/r/903274 (https://phabricator.wikimedia.org/T308799)
[15:16:19] <wikibugs>	 (03PS2) 10Vgutierrez: varnish: Bypass ATS for esitest requests [puppet] - 10https://gerrit.wikimedia.org/r/903274 (https://phabricator.wikimedia.org/T308799)
[15:16:32] <jinxer-wm>	 (SystemdUnitFailed) firing: (26) sync_check_icinga_contacts.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:16:32] <jinxer-wm>	 (SystemdUnitFailed) firing: (13) generate_os_reports.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:17:00] <logmsgbot>	 !log elukey@deploy2002 Synchronized private/PrivateSettings.php: (no justification provided) (duration: 06m 10s)
[15:17:45] <icinga-wm>	 PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:17:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:19:57] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop: sync
[15:20:32] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40345/console" [puppet] - 10https://gerrit.wikimedia.org/r/903274 (https://phabricator.wikimedia.org/T308799) (owner: 10Vgutierrez)
[15:20:39] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop: sync
[15:20:44] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop: sync
[15:20:48] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop: sync
[15:21:32] <jinxer-wm>	 (SystemdUnitFailed) firing: (53) sync_check_icinga_contacts.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:21:32] <jinxer-wm>	 (SystemdUnitFailed) firing: (13) generate_os_reports.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:22:18] <wikibugs>	 (03PS21) 10Jbond: sre.puppetboard.restart-reboot: create a reboot book for puppetboard [cookbooks] - 10https://gerrit.wikimedia.org/r/849093
[15:22:22] <wikibugs>	 (03PS3) 10Vgutierrez: varnish: Bypass ATS for esitest requests [puppet] - 10https://gerrit.wikimedia.org/r/903274 (https://phabricator.wikimedia.org/T308799)
[15:22:25] <wikibugs>	 (03PS8) 10Jbond: sre.netbox.restart-reboot: create a reboot book for netbox [cookbooks] - 10https://gerrit.wikimedia.org/r/849135
[15:22:35] <wikibugs>	 (03PS9) 10Jbond: sre.netbox.restart-reboot: create a reboot book for netbox [cookbooks] - 10https://gerrit.wikimedia.org/r/849135
[15:22:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:23:27] <icinga-wm>	 PROBLEM - puppet last run on rdb2008 is CRITICAL: CRITICAL: Puppet last ran 6 days ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[15:24:49] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.netbox.restart-reboot: create a reboot book for netbox [cookbooks] - 10https://gerrit.wikimedia.org/r/849135 (owner: 10Jbond)
[15:25:12] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.puppetboard.restart-reboot: create a reboot book for puppetboard [cookbooks] - 10https://gerrit.wikimedia.org/r/849093 (owner: 10Jbond)
[15:26:19] <wikibugs>	 10SRE-swift-storage, 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 10): Storage request: swift s3 bucket for mediawiki-page-content-change-enrichment checkpointing - https://phabricator.wikimedia.org/T330693 (10Ottomata) > usefulness of cross-DC replication  After asking @dcausse, I unde...
[15:27:16] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops, 10Patch-For-Review: Represent sub-interface and bridge device assocations in Netbox - https://phabricator.wikimedia.org/T296832 (10Volans) Looks ok to me too, I'm no sure about all the details involved if we need to patch things like the dns genera...
[15:29:13] <icinga-wm>	 PROBLEM - Check systemd state on db1101 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-mysqld-exporter@s7.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:29:23] <icinga-wm>	 RECOVERY - puppet last run on rdb2008 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[15:29:45] <wikibugs>	 (03CR) 10BBlack: [C: 03+1] "Seems right to me, for this testing!" [puppet] - 10https://gerrit.wikimedia.org/r/903274 (https://phabricator.wikimedia.org/T308799) (owner: 10Vgutierrez)
[15:30:05] <jouncebot>	 jan_drewniak: May I have your attention please! Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230327T1530)
[15:32:44] <wikibugs>	 (03PS22) 10Jbond: sre.puppetboard.restart-reboot: create a reboot book for puppetboard [cookbooks] - 10https://gerrit.wikimedia.org/r/849093
[15:32:55] <wikibugs>	 (03PS10) 10Jbond: sre.netbox.restart-reboot: create a reboot book for netbox [cookbooks] - 10https://gerrit.wikimedia.org/r/849135
[15:33:00] <wikibugs>	 (03PS11) 10Jbond: sre.netbox.restart-reboot: create a reboot book for netbox [cookbooks] - 10https://gerrit.wikimedia.org/r/849135
[15:34:49] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.puppetboard.restart-reboot: create a reboot book for puppetboard [cookbooks] - 10https://gerrit.wikimedia.org/r/849093 (owner: 10Jbond)
[15:35:46] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.netbox.restart-reboot: create a reboot book for netbox [cookbooks] - 10https://gerrit.wikimedia.org/r/849135 (owner: 10Jbond)
[15:36:36] <wikibugs>	 10SRE, 10MediaWiki-extensions-OAuth, 10Performance-Team, 10Datacenter-Switchover: Frequent OAuth failures on Wikimedia wikis since eqiad was repooled due to db-mainstash replication lag - https://phabricator.wikimedia.org/T332650 (10Samwalton9)
[15:42:03] <wikibugs>	 10SRE, 10SRE-Access-Requests: Grant Access to analytics_privatedata_users for FNavas-foundation - https://phabricator.wikimedia.org/T331482 (10Ottomata) Hi, just back from vacation too.   @FNavas-foundation can you update the task description with exactly what you need access too?  Your comment mentions a 'spe...
[15:44:15] <icinga-wm>	 RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:46:31] <jinxer-wm>	 (SystemdUnitFailed) firing: (11) generate_os_reports.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:46:51] <wikibugs>	 (03PS1) 10Ayounsi: Varnish: prefix 403 and 429 with a unique ID [puppet] - 10https://gerrit.wikimedia.org/r/903284 (https://phabricator.wikimedia.org/T330973)
[15:46:55] <wikibugs>	 (03PS1) 10Filippo Giunchedi: alertmanager: default to IRC for foundations [puppet] - 10https://gerrit.wikimedia.org/r/903285
[15:47:02] <godog>	 jbond: ^
[15:50:26] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/903264 (https://phabricator.wikimedia.org/T322369) (owner: 10EoghanGaffney)
[15:50:53] <wikibugs>	 (03Abandoned) 10Abijeet Patro: Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/903228 (owner: 10L10n-bot)
[15:51:53] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Yiannis Giannelos - https://phabricator.wikimedia.org/T332063 (10Ottomata) I think they need [[ https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Access_Levels | sql_lab role permissions in Superset ]]. Pinging @Milimetr...
[15:53:35] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] admin: add user kamila [puppet] - 10https://gerrit.wikimedia.org/r/902444 (https://phabricator.wikimedia.org/T332921) (owner: 10Kamila Součková)
[15:53:38] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to production for kamila - https://phabricator.wikimedia.org/T332921 (10hnowlan)
[15:54:04] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 (10Clement_Goubert)
[15:54:52] <logmsgbot>	 !log ebysans@deploy2002 Started deploy [airflow-dags/analytics@e7f9c7f]: (no justification provided)
[15:55:03] <logmsgbot>	 !log ebysans@deploy2002 Finished deploy [airflow-dags/analytics@e7f9c7f]: (no justification provided) (duration: 00m 11s)
[15:55:44] <wikibugs>	 (03PS11) 10Jbond: sre.SREDnsDiscBatchRunnerBase: add new base class for dnsdisc services [cookbooks] - 10https://gerrit.wikimedia.org/r/849130
[15:56:40] <wikibugs>	 (03PS4) 10Vgutierrez: varnish: Bypass ATS for esitest requests [puppet] - 10https://gerrit.wikimedia.org/r/903274 (https://phabricator.wikimedia.org/T308799)
[15:57:13] <wikibugs>	 (03CR) 10Jbond: sre.SREDnsDiscBatchRunnerBase: add new base class for dnsdisc services (036 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/849130 (owner: 10Jbond)
[15:57:56] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.SREDnsDiscBatchRunnerBase: add new base class for dnsdisc services [cookbooks] - 10https://gerrit.wikimedia.org/r/849130 (owner: 10Jbond)
[15:58:10] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] alertmanager: default to IRC for foundations [puppet] - 10https://gerrit.wikimedia.org/r/903285 (owner: 10Filippo Giunchedi)
[15:58:14] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] planet: Add Wikimedia category of Jan Ainali's blog [puppet] - 10https://gerrit.wikimedia.org/r/902829 (owner: 10Legoktm)
[15:58:45] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] planet: Add Nemo_bis's new blog [puppet] - 10https://gerrit.wikimedia.org/r/902828 (owner: 10Legoktm)
[15:59:16] <wikibugs>	 (03PS2) 10Dzahn: planet: Add Wikimedia category of Jan Ainali's blog [puppet] - 10https://gerrit.wikimedia.org/r/902829 (owner: 10Legoktm)
[16:03:53] <wikibugs>	 (03PS12) 10Jbond: sre.SREDnsDiscBatchRunnerBase: add new base class for dnsdisc services [cookbooks] - 10https://gerrit.wikimedia.org/r/849130
[16:04:12] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] planet: Use a more human date format than "Friday, 03 2023 March" [puppet] - 10https://gerrit.wikimedia.org/r/902832 (owner: 10Krinkle)
[16:04:15] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] admin_ng: Merge eqiad,codfw namespaces quotes in main [deployment-charts] - 10https://gerrit.wikimedia.org/r/902066 (owner: 10Alexandros Kosiaris)
[16:04:22] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] changeprop-jobqueue: Double resource quotas [deployment-charts] - 10https://gerrit.wikimedia.org/r/902067 (owner: 10Alexandros Kosiaris)
[16:06:10] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.SREDnsDiscBatchRunnerBase: add new base class for dnsdisc services [cookbooks] - 10https://gerrit.wikimedia.org/r/849130 (owner: 10Jbond)
[16:06:52] <wikibugs>	 (03CR) 10Jbond: "for the mypy alerts we need to wait for a spicerack release" [cookbooks] - 10https://gerrit.wikimedia.org/r/849130 (owner: 10Jbond)
[16:08:14] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "looks good, I do want to rename the role to sre_collab, but that will require rebasing one way or another" [puppet] - 10https://gerrit.wikimedia.org/r/903264 (https://phabricator.wikimedia.org/T322369) (owner: 10EoghanGaffney)
[16:10:03] <wikibugs>	 (03Merged) 10jenkins-bot: admin_ng: Merge eqiad,codfw namespaces quotes in main [deployment-charts] - 10https://gerrit.wikimedia.org/r/902066 (owner: 10Alexandros Kosiaris)
[16:10:05] <wikibugs>	 (03Merged) 10jenkins-bot: changeprop-jobqueue: Double resource quotas [deployment-charts] - 10https://gerrit.wikimedia.org/r/902067 (owner: 10Alexandros Kosiaris)
[16:14:06] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: admin: Grant kserve API group read access to deploy user [deployment-charts] - 10https://gerrit.wikimedia.org/r/903297 (https://phabricator.wikimedia.org/T333174)
[16:15:05] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "Luca, Janis, regardless of the outcome of the discussion in the linked task, let me know if this is the preferable way of doing this." [deployment-charts] - 10https://gerrit.wikimedia.org/r/903297 (https://phabricator.wikimedia.org/T333174) (owner: 10Alexandros Kosiaris)
[16:20:52] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: admin: Fix mw-web resource quotas [deployment-charts] - 10https://gerrit.wikimedia.org/r/903298
[16:25:33] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: admin: Make sure resource quotas are honored for staging too [deployment-charts] - 10https://gerrit.wikimedia.org/r/903299
[16:26:14] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] admin: Fix mw-web resource quotas [deployment-charts] - 10https://gerrit.wikimedia.org/r/903298 (owner: 10Alexandros Kosiaris)
[16:29:40] <wikibugs>	 (03PS1) 10Jbond: os-reports: fix yaml data for apt_repo [puppet] - 10https://gerrit.wikimedia.org/r/903301
[16:30:22] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] os-reports: fix yaml data for apt_repo [puppet] - 10https://gerrit.wikimedia.org/r/903301 (owner: 10Jbond)
[16:31:05] <wikibugs>	 (03Merged) 10jenkins-bot: admin: Fix mw-web resource quotas [deployment-charts] - 10https://gerrit.wikimedia.org/r/903298 (owner: 10Alexandros Kosiaris)
[16:31:29] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] admin: Make sure resource quotas are honored for staging too [deployment-charts] - 10https://gerrit.wikimedia.org/r/903299 (owner: 10Alexandros Kosiaris)
[16:32:35] <icinga-wm>	 RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:33:28] <logmsgbot>	 !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'.
[16:34:14] <logmsgbot>	 !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
[16:34:20] <logmsgbot>	 !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'.
[16:34:31] <jinxer-wm>	 (SystemdUnitFailed) firing: (10) generate_os_reports.service Failed on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:34:35] <logmsgbot>	 !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[16:36:05] <wikibugs>	 (03Merged) 10jenkins-bot: admin: Make sure resource quotas are honored for staging too [deployment-charts] - 10https://gerrit.wikimedia.org/r/903299 (owner: 10Alexandros Kosiaris)
[16:39:09] <icinga-wm>	 RECOVERY - Host db1150 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms
[16:39:42] <logmsgbot>	 !log akosiaris@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[16:39:58] <logmsgbot>	 !log akosiaris@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[16:40:03] <akosiaris>	 hashar: changeprop-jobqueue resource-quotas doubled
[16:40:12] <logmsgbot>	 !log akosiaris@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[16:40:59] <logmsgbot>	 !log akosiaris@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[16:43:45] <akosiaris>	 sigh, pinged the wrong person, sorry Antoine
[16:43:50] <akosiaris>	 hnowlan: changeprop-jobqueue resource-quotas doubled
[16:44:00] <wikibugs>	 (03PS1) 10Jbond: idm: remove auto restart for apache-htcacheclean [puppet] - 10https://gerrit.wikimedia.org/r/903302
[16:44:28] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] idm: remove auto restart for apache-htcacheclean [puppet] - 10https://gerrit.wikimedia.org/r/903302 (owner: 10Jbond)
[16:45:23] <wikibugs>	 (03PS1) 10Jbond: Revert "idm: remove auto restart for apache-htcacheclean" [puppet] - 10https://gerrit.wikimedia.org/r/903199
[16:45:29] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] Revert "idm: remove auto restart for apache-htcacheclean" [puppet] - 10https://gerrit.wikimedia.org/r/903199 (owner: 10Jbond)
[16:47:38] <wikibugs>	 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T331373 (10Jhancock.wm) I logged into the ilo and as of now there are no errors on that link.  Papaul pointed me to T330218 where he suggested moving the network port from ge-6/0/6 to ge-6/0/1. Since this issue comes back intermittently, i...
[16:48:13] <wikibugs>	 10SRE, 10Infrastructure Security, 10Infrastructure-Foundations: Offboard nfraison - https://phabricator.wikimedia.org/T333135 (10Dzahn) @jbond Is there a process for offboarding that describes how to do it correctly?  As basically every single time I am trying to edit pwstore it is blocked by an invalid key...
[16:49:29] <hnowlan>	 akosiaris: thanks! 
[16:49:57] <icinga-wm>	 RECOVERY - Check systemd state on idm1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:53:59] <icinga-wm>	 PROBLEM - Host cloudlb2001-dev is DOWN: PING CRITICAL - Packet loss = 100%
[16:54:23] <icinga-wm>	 RECOVERY - Host cloudlb2001-dev is UP: PING OK - Packet loss = 0%, RTA = 31.69 ms
[16:58:29] <wikibugs>	 10SRE, 10Infrastructure Security, 10Infrastructure-Foundations: Offboard nfraison - https://phabricator.wikimedia.org/T333135 (10Dzahn) I removed both nfraison and eoghan from the .users file, re-signed it and then re-encrypted all files that I could encrypt, then pushed to repo.  This does not change their...
[16:59:40] <jinxer-wm>	 (SystemdUnitFailed) firing: (9) idm-sync-permissions.service Failed on idm-test1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230327T1700)
[17:00:05] <jouncebot>	 ryankemper: May I have your attention please! Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230327T1700)
[17:05:26] <jinxer-wm>	 (WidespreadPuppetFailure) firing: Puppet has failed on lvs cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=lvs - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[17:06:26] <sukhe>	 er what's this lvs failure, checking
[17:08:25] <sukhe>	 hmm ran agent manually, resolved. must be transient
[17:15:26] <jinxer-wm>	 (WidespreadPuppetFailure) resolved: Puppet has failed on lvs cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=lvs - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[17:19:43] <wikibugs>	 10SRE, 10Infrastructure Security, 10Infrastructure-Foundations: Offboard nfraison - https://phabricator.wikimedia.org/T333135 (10jbond)
[17:19:57] <wikibugs>	 10SRE, 10Infrastructure Security, 10Infrastructure-Foundations: Offboard nfraison - https://phabricator.wikimedia.org/T333135 (10jbond) 05In progress→03Resolved > @jbond Is there a process for offboarding that describes how to do it correctly?  not really the [[ https://wikitech.wikimedia.org/wiki/SRE_Of...
[17:20:35] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10netops, and 2 others: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10ayounsi) > Aside from duplication of code what are the blockers to having the Kubernetes groups also in Homer?  Th...
[17:23:59] <wikibugs>	 (03CR) 10Dzahn: "Should this be merged before the upgrade or should it wait until the upgrade?" [puppet] - 10https://gerrit.wikimedia.org/r/903260 (https://phabricator.wikimedia.org/T206049) (owner: 10Hashar)
[17:25:26] <wikibugs>	 (03CR) 10Dzahn: "We already have bullseye doc machines thanks to Andrea's work. We should just switch to those." [puppet] - 10https://gerrit.wikimedia.org/r/901612 (https://phabricator.wikimedia.org/T322357) (owner: 10Hashar)
[17:27:55] <wikibugs>	 (03CR) 10Dzahn: "just means there will be a lot more rebasing because we keep adding to this. in that case it's easier to abandon it" [puppet] - 10https://gerrit.wikimedia.org/r/902791 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn)
[17:28:04] <wikibugs>	 (03Abandoned) 10Dzahn: monitoring/alerting: globally replace serviceops-collab with sre-collab [puppet] - 10https://gerrit.wikimedia.org/r/902791 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn)
[17:31:52] <wikibugs>	 10SRE, 10SRE-Access-Requests: Move mbsantos and jgiannelos from parsoid-test-admins to parsoid-test-roots - https://phabricator.wikimedia.org/T333206 (10ssastry)
[17:34:59] <wikibugs>	 (03PS1) 10Subramanya Sastry: Parsoid-Tests: Give Yiannis and Mateus root rights on parsoid-test hosts [puppet] - 10https://gerrit.wikimedia.org/r/903314 (https://phabricator.wikimedia.org/T333206)
[17:37:39] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] P:url_downloader send Squid access logs to Logstash (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/903265 (owner: 10Slyngshede)
[17:41:05] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10Data-Persistence-Backup: db1150 crashed: DIMM_A8 memory issues - https://phabricator.wikimedia.org/T332708 (10Cmjohnson) 05Open→03Resolved The DIMM has been replaced, I updated the idrac and bios while it was offline.
[17:45:13] <wikibugs>	 (03CR) 10Dzahn: releases-jenkins: replace Icinga with Prometheus monitoring (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/902788 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn)
[17:56:01] <jinxer-wm>	 (SystemdUnitFailed) firing: (12) wmf_auto_restart_apache2-htcacheclean.service Failed on arclamp1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:59:16] <jinxer-wm>	 (SystemdUnitFailed) firing: (12) wmf_auto_restart_apache2-htcacheclean.service Failed on arclamp1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:06:26] <jinxer-wm>	 (WidespreadPuppetFailure) firing: Puppet has failed on dnsbox cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=dnsbox - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[18:07:35] <wikibugs>	 (03PS3) 10Andrea Denisse: doc: Add support for passive_hosts synchronization via rsync [puppet] - 10https://gerrit.wikimedia.org/r/902825 (https://phabricator.wikimedia.org/T319477)
[18:09:33] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] doc: Add support for passive_hosts synchronization via rsync [puppet] - 10https://gerrit.wikimedia.org/r/902825 (https://phabricator.wikimedia.org/T319477) (owner: 10Andrea Denisse)
[18:11:15] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10Papaul) We moved the fist batch of servers today all went well.
[18:15:56] <wikibugs>	 (03PS1) 10Dzahn: alertmanager: send sre-collab alerts to -operations and -sre-collab [puppet] - 10https://gerrit.wikimedia.org/r/903318 (https://phabricator.wikimedia.org/T329587)
[18:16:26] <wikibugs>	 (03PS1) 10Andrea Denisse: doc: Reserve UID/GID for the doc-uploader system user [puppet] - 10https://gerrit.wikimedia.org/r/903319 (https://phabricator.wikimedia.org/T319477)
[18:19:00] <wikibugs>	 (03CR) 10Andrea Denisse: [V: 03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40349/console" [puppet] - 10https://gerrit.wikimedia.org/r/903319 (https://phabricator.wikimedia.org/T319477) (owner: 10Andrea Denisse)
[18:19:21] <wikibugs>	 (03PS2) 10Andrea Denisse: doc: Reserve UID/GID for the doc-uploader system user [puppet] - 10https://gerrit.wikimedia.org/r/903319 (https://phabricator.wikimedia.org/T319477)
[18:20:23] <wikibugs>	 (03CR) 10Andrea Denisse: [V: 03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40350/console" [puppet] - 10https://gerrit.wikimedia.org/r/903319 (https://phabricator.wikimedia.org/T319477) (owner: 10Andrea Denisse)
[18:20:42] <wikibugs>	 (03PS3) 10Andrea Denisse: doc: Reserve UID/GID for the doc-uploader system user [puppet] - 10https://gerrit.wikimedia.org/r/903319 (https://phabricator.wikimedia.org/T319477)
[18:21:47] <wikibugs>	 (03CR) 10Andrea Denisse: [V: 03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40351/console" [puppet] - 10https://gerrit.wikimedia.org/r/903319 (https://phabricator.wikimedia.org/T319477) (owner: 10Andrea Denisse)
[18:25:05] <wikibugs>	 (03PS4) 10Andrea Denisse: doc: Reserve UID/GID for the doc-uploader system user [puppet] - 10https://gerrit.wikimedia.org/r/903319 (https://phabricator.wikimedia.org/T319477)
[18:28:50] <wikibugs>	 (03CR) 10Dzahn: "It looks ok in compiler, and I can check in devtools, but I don't want to get into follow-ups in deployment-prep and other projects." [puppet] - 10https://gerrit.wikimedia.org/r/900353 (https://phabricator.wikimedia.org/T329622) (owner: 10Jaime Nuche)
[18:29:05] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] deployment_server: ensure Docker is installed [puppet] - 10https://gerrit.wikimedia.org/r/900353 (https://phabricator.wikimedia.org/T329622) (owner: 10Jaime Nuche)
[18:31:27] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/903319 (https://phabricator.wikimedia.org/T319477) (owner: 10Andrea Denisse)
[18:37:08] <wikibugs>	 (03PS14) 10Ayounsi: Refactor and centralize BGPpeer config [deployment-charts] - 10https://gerrit.wikimedia.org/r/887945 (https://phabricator.wikimedia.org/T306649)
[18:37:18] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10Papaul) @cmooney second batch proposal below |Host|U space|Existing port|New port| |cloudcephosd2002-de...
[18:37:51] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T333157 (10KFrancis) Hello, @oleksandr_tsyba_WMDE, I'll be helping with this request.  Would you please send your WMDE email address to kfrancis@wikimedia.org?
[18:41:26] <jinxer-wm>	 (WidespreadPuppetFailure) resolved: Puppet has failed on dnsbox cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=dnsbox - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[18:41:42] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Refactor and centralize BGPpeer config [deployment-charts] - 10https://gerrit.wikimedia.org/r/887945 (https://phabricator.wikimedia.org/T306649) (owner: 10Ayounsi)
[18:43:47] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "confirmed noop on production deployment servers, deploy1002 and deploy2002 - fails in devtools, mostly expected" [puppet] - 10https://gerrit.wikimedia.org/r/900353 (https://phabricator.wikimedia.org/T329622) (owner: 10Jaime Nuche)
[18:45:35] <wikibugs>	 10SRE, 10MediaWiki-extensions-OAuth, 10Datacenter-Switchover, 10Performance-Team (Radar): Frequent OAuth failures on Wikimedia wikis since eqiad was repooled due to db-mainstash replication lag - https://phabricator.wikimedia.org/T332650 (10larissagaulia)
[18:45:48] <wikibugs>	 (03PS1) 10Dzahn: Revert "deployment_server: ensure Docker is installed" [puppet] - 10https://gerrit.wikimedia.org/r/903200
[18:46:00] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "unable to configure the Docker daemon with file /etc/docker/daemon.json: the following directives don't match any configuration option: st" [puppet] - 10https://gerrit.wikimedia.org/r/903200 (owner: 10Dzahn)
[18:51:02] <wikibugs>	 (03CR) 10Dzahn: "what fails about them? or rather, what bothered you about them?" [puppet] - 10https://gerrit.wikimedia.org/r/903179 (https://phabricator.wikimedia.org/T331896) (owner: 10Jcrespo)
[18:52:43] <wikibugs>	 (03PS1) 10Dzahn: Revert "bacula: Add miscweb2003 jobs to the list of monitoring-ignored jobs" [puppet] - 10https://gerrit.wikimedia.org/r/903201
[18:54:42] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "puppet works again on deploy-1004.devtools after reverting so should be fine in deployment-prep as well" [puppet] - 10https://gerrit.wikimedia.org/r/903200 (owner: 10Dzahn)
[18:56:30] <wikibugs>	 (03CR) 10Dzahn: "changes to admin groups might require access request tickets, this should be done between clinic duty and serviceops team. I don't have co" [puppet] - 10https://gerrit.wikimedia.org/r/903314 (https://phabricator.wikimedia.org/T333206) (owner: 10Subramanya Sastry)
[18:57:37] <wikibugs>	 (03CR) 10Dzahn: "If I do https://gerrit.wikimedia.org/r/c/operations/puppet/+/903318 first then we will still have IRC alerting as before." [puppet] - 10https://gerrit.wikimedia.org/r/902785 (https://phabricator.wikimedia.org/T331901) (owner: 10Dzahn)
[18:57:46] <wikibugs>	 (03CR) 10Dzahn: "If I do https://gerrit.wikimedia.org/r/c/operations/puppet/+/903318 first then we will still have IRC alerting as before." [puppet] - 10https://gerrit.wikimedia.org/r/902788 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn)
[19:06:56] <wikibugs>	 (03CR) 10Dzahn: zuul: fix up service enable and ensure (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/901576 (https://phabricator.wikimedia.org/T324659) (owner: 10Hashar)
[19:07:48] <wikibugs>	 (03PS1) 10Jdlrobson: Expand list of wikis with language button at top. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903322 (https://phabricator.wikimedia.org/T331777)
[19:10:10] <wikibugs>	 (03PS1) 10Superpes15: [sysop_itwiki] Add the logo also for vector 2022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903323 (https://phabricator.wikimedia.org/T330279)
[19:11:17] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/output/901576/40355/" [puppet] - 10https://gerrit.wikimedia.org/r/901576 (https://phabricator.wikimedia.org/T324659) (owner: 10Hashar)
[19:15:00] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "confirmed this changed nothing on all 3 contint* servers. zuul is still running on contint2001, masked on contint1002 and unknown on conti" [puppet] - 10https://gerrit.wikimedia.org/r/901576 (https://phabricator.wikimedia.org/T324659) (owner: 10Hashar)
[19:18:29] <wikibugs>	 (03PS2) 10Jdlrobson: Enable web based viewing of ReadingLists on mediawiki.org and metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902197 (https://phabricator.wikimedia.org/T322093)
[19:21:22] <logmsgbot>	 !log ebernhardson@deploy2002 Started deploy [airflow-dags/search@3259099]: bump glent jar to 0.3.2
[19:21:37] <logmsgbot>	 !log ebernhardson@deploy2002 Finished deploy [airflow-dags/search@3259099]: bump glent jar to 0.3.2 (duration: 00m 14s)
[19:25:29] <wikibugs>	 (03PS1) 10Superpes15: Disable VisualEditor from talk namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903326
[19:26:01] <jinxer-wm>	 (SystemdUnitFailed) firing: (12) wmf_auto_restart_apache2-htcacheclean.service Failed on arclamp1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:29:11] <jinxer-wm>	 (SystemdUnitFailed) firing: (12) wmf_auto_restart_apache2-htcacheclean.service Failed on arclamp1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:30:03] <wikibugs>	 10SRE, 10Traffic, 10HTTPS: Enable HSTS on store.wikimedia.org for HTTPS - https://phabricator.wikimedia.org/T128559 (10greg) Hi @BCornwall ! I'm just jumping in as an FR-Tech representative. I think I've got the summary here (basically, in the end, Shopify can't meet our hsts header needs which blocks overal...
[19:40:47] <wikibugs>	 (03PS1) 10Dzahn: admin: add Barakat Ajadi as ldap_only_admin (wmf group) [puppet] - 10https://gerrit.wikimedia.org/r/903328 (https://phabricator.wikimedia.org/T332868)
[19:43:25] <wikibugs>	 (03PS2) 10Superpes15: [sysop_itwiki] Add the logo also for vector 2022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903323 (https://phabricator.wikimedia.org/T330279)
[19:44:08] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Grafana access to babiola - https://phabricator.wikimedia.org/T332868 (10Dzahn) @larissagaulia Thank you for adding the information. Does "until July" mean "until last day of June"? I uploaded a code change above that is now in review.  Access requests...
[19:45:13] <wikibugs>	 (03CR) 10Ahmon Dancy: Revert "deployment_server: ensure Docker is installed" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/903200 (owner: 10Dzahn)
[19:45:32] <wikibugs>	 (03PS3) 10Superpes15: [sysop_itwiki] Add the logo also for vector 2022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903323 (https://phabricator.wikimedia.org/T330279)
[19:46:20] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Grafana access to babiola - https://phabricator.wikimedia.org/T332868 (10Dzahn) 05Open→03In progress p:05Triage→03Medium
[19:46:31] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Grafana access to babiola - https://phabricator.wikimedia.org/T332868 (10Dzahn) a:03Ladsgroup
[19:49:11] <wikibugs>	 (03CR) 10Dzahn: "this may have caused that you can't include the docker class on new hosts anymore without a puppet error. : https://gerrit.wikimedia.org/r" [puppet] - 10https://gerrit.wikimedia.org/r/902318 (https://phabricator.wikimedia.org/T332803) (owner: 10JMeybohm)
[19:52:02] <wikibugs>	 (03CR) 10Ahmon Dancy: k8s: Force docker storage-driver to overlay2 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/902318 (https://phabricator.wikimedia.org/T332803) (owner: 10JMeybohm)
[19:56:01] <jinxer-wm>	 (SystemdUnitFailed) firing: (12) wmf_auto_restart_apache2-htcacheclean.service Failed on arclamp1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:59:11] <jinxer-wm>	 (SystemdUnitFailed) firing: (12) wmf_auto_restart_apache2-htcacheclean.service Failed on arclamp1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:00:04] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230327T2000).
[20:00:04] <jouncebot>	 jdlrobson and Superpes: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:31] <Jdlrobson>	 here
[20:00:47] <kindrobot>	 I can deploy
[20:00:54] <Superpes>	 Hi :)
[20:01:11] <kindrobot>	 Jdlrobson: is it safe to deploy your two together?
[20:01:51] <Jdlrobson>	 kindrobot: yep
[20:01:55] <kindrobot>	 !log start UTC late backport window
[20:01:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:02:56] <wikibugs>	 (03PS1) 10Ahmon Dancy: k8s: Use storage-driver instead of storage_driver [puppet] - 10https://gerrit.wikimedia.org/r/903329 (https://phabricator.wikimedia.org/T332803)
[20:04:58] <kindrobot>	 Jdlrobson: what's modern-manpage?
[20:05:04] <wikibugs>	 (03CR) 10Ahmon Dancy: k8s: Force docker storage-driver to overlay2 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/902318 (https://phabricator.wikimedia.org/T332803) (owner: 10JMeybohm)
[20:05:53] <kindrobot>	 Oh, mainpage. I feel silly
[20:06:43] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kindrobot@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903322 (https://phabricator.wikimedia.org/T331777) (owner: 10Jdlrobson)
[20:06:46] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kindrobot@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902197 (https://phabricator.wikimedia.org/T322093) (owner: 10Jdlrobson)
[20:07:31] <wikibugs>	 (03Merged) 10jenkins-bot: Expand list of wikis with language button at top. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903322 (https://phabricator.wikimedia.org/T331777) (owner: 10Jdlrobson)
[20:08:50] <wikibugs>	 (03PS3) 10Stef Dunlap: Enable web based viewing of ReadingLists on mediawiki.org and metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902197 (https://phabricator.wikimedia.org/T322093) (owner: 10Jdlrobson)
[20:09:03] <wikibugs>	 (03CR) 10TrainBranchBot: "Approved by kindrobot@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902197 (https://phabricator.wikimedia.org/T322093) (owner: 10Jdlrobson)
[20:09:47] <wikibugs>	 (03Merged) 10jenkins-bot: Enable web based viewing of ReadingLists on mediawiki.org and metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/902197 (https://phabricator.wikimedia.org/T322093) (owner: 10Jdlrobson)
[20:10:02] <logmsgbot>	 !log kindrobot@deploy2002 Started scap: Backport for [[gerrit:903322|Expand list of wikis with language button at top. (T331777)]], [[gerrit:902197|Enable web based viewing of ReadingLists on mediawiki.org and metawiki (T322093)]]
[20:10:09] <stashbot>	 T322093: Enable viewing of reading lists on web - https://phabricator.wikimedia.org/T322093
[20:10:10] <stashbot>	 T331777: Add a header at the top of the Main page of French Wikiquote - https://phabricator.wikimedia.org/T331777
[20:11:29] <logmsgbot>	 !log kindrobot@deploy2002 jdlrobson and kindrobot: Backport for [[gerrit:903322|Expand list of wikis with language button at top. (T331777)]], [[gerrit:902197|Enable web based viewing of ReadingLists on mediawiki.org and metawiki (T322093)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet
[20:12:08] <kindrobot>	 Jdlrobson: ready to check
[20:13:18] <Jdlrobson>	 looking :)
[20:14:01] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.decommission for hosts cloudvirt1017.eqiad.wmnet
[20:14:11] <jinxer-wm>	 (SystemdUnitFailed) firing: (13) wmf_auto_restart_apache2-htcacheclean.service Failed on arclamp1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:14:33] <Jdlrobson>	 kindrobot: LGTM!
[20:15:12] <kindrobot>	 Great, syncing.
[20:15:53] <kindrobot>	 Superpes is it safe to deploy your two patches together?
[20:16:01] <jinxer-wm>	 (SystemdUnitFailed) firing: (14) wmf_auto_restart_apache2-htcacheclean.service Failed on arclamp1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:16:44] <Superpes>	 Yep no issue :) kindrobot
[20:17:12] <wikibugs>	 (03PS1) 10Andrew Bogott: Remove puppet references to cloudvirt1017, 1021, and 1022 [puppet] - 10https://gerrit.wikimedia.org/r/903330 (https://phabricator.wikimedia.org/T333169)
[20:18:27] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Remove puppet references to cloudvirt1017, 1021, and 1022 [puppet] - 10https://gerrit.wikimedia.org/r/903330 (https://phabricator.wikimedia.org/T333169) (owner: 10Andrew Bogott)
[20:19:11] <jinxer-wm>	 (SystemdUnitFailed) firing: (15) wmf_auto_restart_prometheus-icinga-am.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:19:23] <wikibugs>	 (03CR) 10Subramanya Sastry: Parsoid-Tests: Give Yiannis and Mateus root rights on parsoid-test hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/903314 (https://phabricator.wikimedia.org/T333206) (owner: 10Subramanya Sastry)
[20:20:52] <logmsgbot>	 !log kindrobot@deploy2002 Finished scap: Backport for [[gerrit:903322|Expand list of wikis with language button at top. (T331777)]], [[gerrit:902197|Enable web based viewing of ReadingLists on mediawiki.org and metawiki (T322093)]] (duration: 10m 50s)
[20:21:01] <jinxer-wm>	 (SystemdUnitFailed) firing: (28) wmf_auto_restart_prometheus-icinga-am.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:21:02] <stashbot>	 T322093: Enable viewing of reading lists on web - https://phabricator.wikimedia.org/T322093
[20:21:02] <stashbot>	 T331777: Add a header at the top of the Main page of French Wikiquote - https://phabricator.wikimedia.org/T331777
[20:21:25] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.dns.netbox
[20:22:36] <kindrobot>	 Amir1: should we be worried about these systemd units failing before proceeding with the backports?
[20:23:20] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudvirt1017.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1001"
[20:24:11] <jinxer-wm>	 (SystemdUnitFailed) firing: (67) wmf_auto_restart_prometheus-icinga-am.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:25:01] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudvirt1017.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1001"
[20:25:01] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[20:25:02] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudvirt1017.eqiad.wmnet
[20:25:20] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.decommission for hosts cloudvirt1021.eqiad.wmnet
[20:26:01] <jinxer-wm>	 (SystemdUnitFailed) firing: (67) wmf_auto_restart_prometheus-icinga-am.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:27:15] <Amir1>	 kindrobot: which one is it?
[20:27:37] <Jdlrobson>	 thanks kindrobot  ! looking good on production!
[20:27:54] <Amir1>	 the alert2001 one, it should be fine for now
[20:28:36] <kindrobot>	 SystemdUnitFailed) firing: (67) wmf_auto_restart_prometheus-icinga-am.service Failed on alert2001:9100 | (SystemdUnitFailed) firing: (14) wmf_auto_restart_apache2-htcacheclean.service Failed on arclamp1001:9100
[20:28:57] <taavi>	 it's a new alert, I suspect it's actually failing for longer
[20:29:07] <Amir1>	 there is way too many systemd unit fail, sigh https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:29:19] <Amir1>	 it's ongoing for a while it seems
[20:29:44] <Amir1>	 cwhite: maybe you know what's going on? speically on alert2001
[20:30:00] <kindrobot>	 So would you advise continuing with the backports?
[20:31:04] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.dns.netbox
[20:31:28] <taavi>	 kindrobot: I'd ignore the alerts and continue
[20:31:40] <kindrobot>	 OK, thank you. :)
[20:31:46] <Amir1>	 yeah, it's not related for sure
[20:32:51] <wikibugs>	 (03CR) 10EoghanGaffney: [C: 03+2] Assign insetup role to new aphlict vm [puppet] - 10https://gerrit.wikimedia.org/r/903264 (https://phabricator.wikimedia.org/T322369) (owner: 10EoghanGaffney)
[20:33:16] <wikibugs>	 (03PS4) 10Stef Dunlap: [sysop_itwiki] Add the logo also for vector 2022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903323 (https://phabricator.wikimedia.org/T330279) (owner: 10Superpes15)
[20:33:16] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudvirt1021.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1001"
[20:34:31] <cwhite>	 Amir1: thanks for the heads up, I'll look into the auto restart failure
[20:35:13] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudvirt1021.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1001"
[20:35:13] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[20:35:14] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudvirt1021.eqiad.wmnet
[20:35:36] <wikibugs>	 (03PS2) 10Stef Dunlap: Disable VisualEditor from talk namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903326 (owner: 10Superpes15)
[20:35:42] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kindrobot@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903326 (owner: 10Superpes15)
[20:35:44] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kindrobot@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903323 (https://phabricator.wikimedia.org/T330279) (owner: 10Superpes15)
[20:36:10] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.decommission for hosts cloudvirt1022.eqiad.wmnet
[20:37:32] <wikibugs>	 (03Merged) 10jenkins-bot: [sysop_itwiki] Add the logo also for vector 2022 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903323 (https://phabricator.wikimedia.org/T330279) (owner: 10Superpes15)
[20:41:26] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] "Very true. Sorry for causing trouble!" [puppet] - 10https://gerrit.wikimedia.org/r/903329 (https://phabricator.wikimedia.org/T332803) (owner: 10Ahmon Dancy)
[20:41:49] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.dns.netbox
[20:43:50] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudvirt1022.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1001"
[20:45:04] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudvirt1022.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1001"
[20:45:04] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[20:45:05] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudvirt1022.eqiad.wmnet
[20:45:19] <wikibugs>	 10ops-eqiad, 10cloud-services-team, 10decommission-hardware: decommission cloudvirt1017.eqiad.wmnet and cloudvirt102[1-2].eqiad.wmnet - https://phabricator.wikimedia.org/T333169 (10Andrew) a:05Andrew→03Jclark-ctr
[20:46:41] <wikibugs>	 (03PS1) 10Jgreen: payments-listener.frdev.wikimedia.org cname for new FR staging server [dns] - 10https://gerrit.wikimedia.org/r/903332 (https://phabricator.wikimedia.org/T285892)
[20:50:29] <Amir1>	 cwhite: it might be related but we are getting systemd unit fail on db1101 but the alert doesn't make sense
[20:50:47] <Amir1>	 as it's really not failing
[20:51:03] <Amir1>	 (maybe? I'll check)
[20:51:24] <cwhite>	 From the auto-restart timer?
[20:51:25] <jinxer-wm>	 (SystemdUnitFailed) firing: (13) wmf_auto_restart_prometheus-icinga-am.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:52:08] <wikibugs>	 10SRE, 10Traffic, 10HTTPS: Enable HSTS on store.wikimedia.org for HTTPS - https://phabricator.wikimedia.org/T128559 (10BCornwall) Hey, @greg. It's not blocking overall improvement, it's just not [[ https://wikitech.wikimedia.org/wiki/HTTPS#Current_policies_and_standards | complying with standards ]]. Since s...
[20:52:36] <wikibugs>	 (03CR) 10Jgreen: [C: 03+2] payments-listener.frdev.wikimedia.org cname for new FR staging server [dns] - 10https://gerrit.wikimedia.org/r/903332 (https://phabricator.wikimedia.org/T285892) (owner: 10Jgreen)
[20:55:34] <cwhite>	 Amir1: db1101 is not an s7 host anymore?
[20:55:54] <Amir1>	 probably Manuel moved it but he is not around
[20:56:10] <Amir1>	 I think he said he reset the systemd timer
[20:57:42] <icinga-wm>	 PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[20:57:44] <icinga-wm>	 PROBLEM - Host restbase1033 is DOWN: PING CRITICAL - Packet loss = 100%
[20:57:56] <cwhite>	 I'm guessing there are auto-restart timers lingering that aren't being cleaned up by puppet.
[20:57:58] <icinga-wm>	 PROBLEM - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/feed/onthisday/{type}/{month}/{day} (retrieve selected events on January 15) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 
[20:57:58] <icinga-wm>	 th aggregated=true)) timed out before a response was received https://wikitech.wikimedia.org/wiki/Wikifeeds
[20:59:36] <marostegui>	 cwhite: it's not a S7 no
[20:59:40] <marostegui>	 it's in M1 
[21:00:03] <marostegui>	 I disabled both systemd units
[21:00:05] <jouncebot>	 Reedy, sbassett, Maryum, and manfredi: #bothumor My software never has bugs. It just develops random features. Rise for Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230327T2100).
[21:00:42] <kindrobot>	 Note: the backport deploy window is still in progress
[21:01:06] <jinxer-wm>	 (SystemdUnitFailed) firing: (4) idm-sync-permissions.service Failed on idm-test1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:01:12] <kindrobot>	 taavi: it seems like its stalled out. It's cleared CI, but it hasn't merged
[21:02:25] <kindrobot>	 https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/903326/
[21:02:50] <icinga-wm>	 RECOVERY - Host restbase1033 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms
[21:03:46] <icinga-wm>	 PROBLEM - cassandra-c CQL 10.64.48.153:9042 on restbase1033 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886
[21:04:38] <icinga-wm>	 PROBLEM - cassandra-b CQL 10.64.48.152:9042 on restbase1033 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886
[21:07:22] <kindrobot>	 dancy: ^
[21:11:02] <tzatziki>	 !log moving Universal Code of Conduct/Enforcement guidelines -> Universal Code of Conduct/Enforcement guidelines/Version 1 on metawiki with `extensions/Translate/scripts/moveTranslatableBundle.php `
[21:11:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:11:17] <taavi>	 kindrobot: somehow the +2 was applied to PS1 while PS2 was the latest
[21:11:19] <tzatziki>	 (probably don't need to log that but just in case)
[21:11:25] <Superpes>	 Uh it doen't want to merge it
[21:11:51] <Superpes>	 Oh
[21:12:17] <thcipriani>	 hrm
[21:12:33] <taavi>	 so just re-+2 it and probably file a bug in scap
[21:12:39] <kindrobot>	 It should probably be OK to scap backport again, eh?
[21:12:42] <kindrobot>	 OK.
[21:12:43] <wikibugs>	 (03CR) 10Andrea Denisse: [V: 03+1 C: 03+2] doc: Reserve UID/GID for the doc-uploader system user [puppet] - 10https://gerrit.wikimedia.org/r/903319 (https://phabricator.wikimedia.org/T319477) (owner: 10Andrea Denisse)
[21:12:48] <thcipriani>	 +1
[21:13:10] <kindrobot>	 Thank you all.
[21:13:17] <wikibugs>	 (03CR) 10TrainBranchBot: "Approved by kindrobot@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903326 (owner: 10Superpes15)
[21:14:03] <jinxer-wm>	 (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:14:12] <wikibugs>	 (03Merged) 10jenkins-bot: Disable VisualEditor from talk namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903326 (owner: 10Superpes15)
[21:14:25] <logmsgbot>	 !log kindrobot@deploy2002 Started scap: Backport for [[gerrit:903326|Disable VisualEditor from talk namespace]], [[gerrit:903323|[sysop_itwiki] Add the logo also for vector 2022 (T330279)]]
[21:14:31] <stashbot>	 T330279: Change logo, wordmark and favicon of sysop_itwiki - https://phabricator.wikimedia.org/T330279
[21:14:54] <logmsgbot>	 !log htriedman@deploy2002 Started deploy [airflow-dags/platform_eng@5f0eb44]: (no justification provided)
[21:15:08] <logmsgbot>	 !log htriedman@deploy2002 Finished deploy [airflow-dags/platform_eng@5f0eb44]: (no justification provided) (duration: 00m 13s)
[21:15:49] <logmsgbot>	 !log kindrobot@deploy2002 kindrobot and superpes: Backport for [[gerrit:903326|Disable VisualEditor from talk namespace]], [[gerrit:903323|[sysop_itwiki] Add the logo also for vector 2022 (T330279)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet
[21:16:23] <kindrobot>	 Ready to check Superpes 
[21:17:02] <Superpes>	 Checked both and everything is fine kindrobot! Thanks! :)
[21:17:22] <kindrobot>	 Thanks, syncing
[21:18:28] <icinga-wm>	 RECOVERY - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[21:18:47] <wikibugs>	 (03CR) 10Dduvall: buildkitd: Isolate build container user/process/network namespaces (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/902132 (https://phabricator.wikimedia.org/T332804) (owner: 10Dduvall)
[21:19:03] <jinxer-wm>	 (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:22:31] <wikibugs>	 (03CR) 10Dduvall: buildkitd: Isolate build container user/process/network namespaces (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/902132 (https://phabricator.wikimedia.org/T332804) (owner: 10Dduvall)
[21:22:51] <logmsgbot>	 !log kindrobot@deploy2002 Finished scap: Backport for [[gerrit:903326|Disable VisualEditor from talk namespace]], [[gerrit:903323|[sysop_itwiki] Add the logo also for vector 2022 (T330279)]] (duration: 08m 26s)
[21:22:57] <stashbot>	 T330279: Change logo, wordmark and favicon of sysop_itwiki - https://phabricator.wikimedia.org/T330279
[21:23:40] <kindrobot>	 Sync finished. Thanks everyone.
[21:23:52] <kindrobot>	 !log finish UTC late backports
[21:23:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:24:11] <jinxer-wm>	 (SystemdUnitFailed) firing: (5) idm-sync-permissions.service Failed on idm-test1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:24:16] <jinxer-wm>	 (SystemdUnitFailed) firing: (30) wmf_auto_restart_apache2-htcacheclean.service Failed on arclamp1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:24:26] <jinxer-wm>	 (WidespreadPuppetFailure) firing: Puppet has failed on wcqs cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=wcqs - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[21:24:30] <icinga-wm>	 RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[21:24:56] <Amir1>	 !log start of watchlist clean up in arwiki (T328501)
[21:24:59] <kindrobot>	 Reedy, sbassett, Maryum, and manfredi backport window finished :)
[21:25:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:25:01] <stashbot>	 T328501: Request to clean my watchlist from articles in namespace 0 and 1 - https://phabricator.wikimedia.org/T328501
[21:25:15] <Superpes>	 Thanks for your time kindrobot :D
[21:25:53] <kindrobot>	 No problem, thank you. :)
[21:26:16] <jinxer-wm>	 (SystemdUnitFailed) firing: (37) wmf_auto_restart_apache2-htcacheclean.service Failed on arclamp1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:27:38] <icinga-wm>	 PROBLEM - Restbase root url on restbase1033 is CRITICAL: connect to address 10.64.48.71 and port 7231: Connection refused https://wikitech.wikimedia.org/wiki/RESTBase
[21:29:11] <jinxer-wm>	 (SystemdUnitFailed) firing: (37) wmf_auto_restart_apache2-htcacheclean.service Failed on arclamp1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:29:24] <icinga-wm>	 PROBLEM - SSH on restbase1033 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[21:30:56] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.64.48.151:9042 on restbase1033 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886
[21:35:17] <wikibugs>	 10SRE, 10vm-requests: Site: 1 VM request for doc2002 - https://phabricator.wikimedia.org/T332819 (10andrea.denisse)
[21:37:59] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Yiannis Giannelos - https://phabricator.wikimedia.org/T332063 (10Ladsgroup) superset should be automatically done via wmf ldap group. If Jgiannelos  is in the ldap group, it should be done already. Correct?
[21:39:35] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for ItamarWMDE - https://phabricator.wikimedia.org/T331899 (10Ladsgroup) a:03Ladsgroup I'm on clinic duty this week. Waiting for signoff by Tyler. Maybe a deployment training can be arranged (or other devs in wmde can do an i...
[21:40:02] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for ItamarWMDE - https://phabricator.wikimedia.org/T331899 (10Ladsgroup) https://wikitech.wikimedia.org/wiki/Deployments/Training
[21:42:50] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for ItamarWMDE - https://phabricator.wikimedia.org/T331899 (10taavi) > To be able to access deplyed Wiki instances and ensure that wikibase (namely wikibase client) is properly installed and configured Unless you're also planni...
[21:43:53] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Yiannis Giannelos - https://phabricator.wikimedia.org/T332063 (10Ottomata) Superset has its own 'roles', and I think something changed in a recent version that makes is so the default role doesn't have access to the SQL lab feat...
[21:45:34] <ryankemper>	 !log T330165 Depooled relevant search platform hosts: `sudo -E cumin 'elastic[1055-1056,1074-1079,1085-1086]*,cloudelastic100[2,6]*,wcqs1002*,wdqs[1007,1012]*' 'sudo depool'`
[21:45:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:45:40] <stashbot>	 T330165: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165
[21:49:11] <jinxer-wm>	 (SystemdUnitFailed) firing: (13) wmf_auto_restart_ircecho.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:56:01] <jinxer-wm>	 (SystemdUnitFailed) firing: (40) wmf_auto_restart_ircecho.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:58:13] <urandom>	 !log power cycling restbase1033 — T333243
[21:58:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:58:18] <stashbot>	 T333243: restbase1033 is down - https://phabricator.wikimedia.org/T333243
[21:58:41] <maryum>	 !log Deploy security fix for T326952
[21:58:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:59:11] <jinxer-wm>	 (SystemdUnitFailed) firing: (63) wmf_auto_restart_ircecho.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:01:08] <icinga-wm>	 PROBLEM - cassandra-b SSL 10.64.48.152:7001 on restbase1033 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[22:01:16] <icinga-wm>	 PROBLEM - cassandra-c SSL 10.64.48.153:7001 on restbase1033 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[22:01:16] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.64.48.151:7001 on restbase1033 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[22:01:34] <icinga-wm>	 RECOVERY - SSH on restbase1033 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[22:01:38] <icinga-wm>	 RECOVERY - Restbase root url on restbase1033 is OK: HTTP OK: HTTP/1.1 200 - 17255 bytes in 0.046 second response time https://wikitech.wikimedia.org/wiki/RESTBase
[22:02:08] <icinga-wm>	 PROBLEM - cassandra-b service on restbase1033 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[22:02:24] <icinga-wm>	 PROBLEM - cassandra-c service on restbase1033 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[22:02:24] <icinga-wm>	 PROBLEM - cassandra-a service on restbase1033 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[22:04:04] <icinga-wm>	 RECOVERY - cassandra-b service on restbase1033 is OK: OK - cassandra-b is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[22:04:18] <icinga-wm>	 RECOVERY - cassandra-c service on restbase1033 is OK: OK - cassandra-c is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[22:04:18] <icinga-wm>	 RECOVERY - cassandra-a service on restbase1033 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[22:04:46] <wikibugs>	 (03PS2) 10EoghanGaffney: Adds php and apache logs for doc machines [puppet] - 10https://gerrit.wikimedia.org/r/900375 (https://phabricator.wikimedia.org/T325245)
[22:04:56] <jinxer-wm>	 (WidespreadPuppetFailure) resolved: Puppet has failed on wcqs cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=wcqs - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[22:05:18] <jinxer-wm>	 (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[22:06:16] <wikibugs>	 (03CR) 10EoghanGaffney: Adds php and apache logs for doc machines (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/900375 (https://phabricator.wikimedia.org/T325245) (owner: 10EoghanGaffney)
[22:06:54] <icinga-wm>	 RECOVERY - cassandra-b SSL 10.64.48.152:7001 on restbase1033 is OK: SSL OK - Certificate restbase1033-b valid until 2024-08-28 11:43:21 +0000 (expires in 519 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[22:07:00] <icinga-wm>	 RECOVERY - cassandra-c SSL 10.64.48.153:7001 on restbase1033 is OK: SSL OK - Certificate restbase1033-c valid until 2024-08-28 11:43:23 +0000 (expires in 519 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[22:07:01] <icinga-wm>	 RECOVERY - cassandra-a SSL 10.64.48.151:7001 on restbase1033 is OK: SSL OK - Certificate restbase1033-a valid until 2024-08-28 11:43:18 +0000 (expires in 519 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[22:07:02] <icinga-wm>	 RECOVERY - cassandra-a CQL 10.64.48.151:9042 on restbase1033 is OK: TCP OK - 0.000 second response time on 10.64.48.151 port 9042 https://phabricator.wikimedia.org/T93886
[22:07:04] <icinga-wm>	 RECOVERY - cassandra-b CQL 10.64.48.152:9042 on restbase1033 is OK: TCP OK - 0.000 second response time on 10.64.48.152 port 9042 https://phabricator.wikimedia.org/T93886
[22:07:22] <icinga-wm>	 RECOVERY - cassandra-c CQL 10.64.48.153:9042 on restbase1033 is OK: TCP OK - 0.000 second response time on 10.64.48.153 port 9042 https://phabricator.wikimedia.org/T93886
[22:09:50] <wikibugs>	 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10Volans) >>! In T330165#8731601, @Stashbot wrote: > {nav icon=file, name=Mentioned in SAL (#wikimedia-operations), href=https://sal.toolforge.org/log/YxgIJY...
[22:10:17] <jinxer-wm>	 (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[22:14:43] <wikibugs>	 (03CR) 10Dzahn: "It's not true that this removes IRC notifications, they were just sent to a test channel only. I am fixing that here: https://gerrit.wikim" [puppet] - 10https://gerrit.wikimedia.org/r/902801 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn)
[22:16:08] <zabe>	 !log zabe@mwmaint2002:~$ mwscript extensions/Translate/scripts/moveTranslatableBundle.php --wiki metawiki "Meta:WMF Support and Safety" "Meta:WMF Trust and Safety" "Zabe" --reason "per [[:phab:T330514|T330514]]" # T330514
[22:16:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:16:15] <stashbot>	 T330514: Rename group-wmf-supportsafety contents - https://phabricator.wikimedia.org/T330514
[22:17:18] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] alertmanager: send sre-collab alerts to -operations and -sre-collab [puppet] - 10https://gerrit.wikimedia.org/r/903318 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn)
[22:18:06] <jinxer-wm>	 (CirrusSearchJobQueueBacklogTooBig) firing: CirrusSearch job topic codfw.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 287.6k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig
[22:19:47] <wikibugs>	 (03PS1) 10Zabe: Rename "Support and Safety" to "Trust and Safety" [extensions/WikimediaMessages] (wmf/1.41.0-wmf.1) - 10https://gerrit.wikimedia.org/r/903205 (https://phabricator.wikimedia.org/T330514)
[22:21:42] <zabe>	 MessageIndexException from line 191 of /srv/mediawiki/php-1.41.0-wmf.1/extensions/Translate/utils/MessageIndex.php: MessageIndex: unable to acquire lock
[22:21:46] <zabe>	 :|
[22:22:26] <jinxer-wm>	 (WidespreadPuppetFailure) firing: Puppet has failed on bastion cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=bastion - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[22:22:27] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops, 10netops: Please update bootp helper on pfw3-eqiad to point to frpm1002 for fundraising subnets - https://phabricator.wikimedia.org/T332939 (10Dwisehaupt) Thanks! Verified working and runs good.
[22:22:54] <herzog>	 zabe: I don't think we need to backport that to the current wmf branch, we can let the train do it when the time comes?
[22:23:34] <herzog>	 oh, well, you're moving the meta pages now - I wanted to wait a bit
[22:24:05] <herzog>	 well, we get this done now, good :)
[22:24:15] <zabe>	 is there anything specific you wanted to wait for?
[22:24:55] <mutante>	 runs puppet on bast1003 because that alert claims puppet fails on bastion "cluster" but also I dont get the graph :)
[22:27:02] <mutante>	 and nothing actually failed there.. so no idea
[22:28:08] <mutante>	 ah, it's bast5003 pushing things over the limit and the usual background ones https://puppetboard.wikimedia.org/nodes?status=failed
[22:29:15] <herzog>	 zabe: my idea was train -> watch for failures -> rename; but since you are backporting it now, I guess there's no need to wait :)
[22:30:54] <zabe>	 well :)
[22:31:10] <zabe>	 jouncebot: nowandnext
[22:31:10] <jouncebot>	 For the next 0 hour(s) and 28 minute(s): Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230327T2100)
[22:31:10] <jouncebot>	 In 3 hour(s) and 28 minute(s): Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230328T0200)
[22:31:29] <wikibugs>	 (03CR) 10Zabe: [C: 03+2] Rename "Support and Safety" to "Trust and Safety" [extensions/WikimediaMessages] (wmf/1.41.0-wmf.1) - 10https://gerrit.wikimedia.org/r/903205 (https://phabricator.wikimedia.org/T330514) (owner: 10Zabe)
[22:42:06] <jinxer-wm>	 (CirrusSearchNodeIndexingNotIncreasing) firing: Elasticsearch instance elastic1074-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[22:43:32] <mutante>	 !log apt2001 - kill 3105; run puppet
[22:43:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:43:45] <mutante>	 !log stat1004 - kill 29291; run puppet
[22:43:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:46:51] <wikibugs>	 (03Merged) 10jenkins-bot: Rename "Support and Safety" to "Trust and Safety" [extensions/WikimediaMessages] (wmf/1.41.0-wmf.1) - 10https://gerrit.wikimedia.org/r/903205 (https://phabricator.wikimedia.org/T330514) (owner: 10Zabe)
[22:47:06] <jinxer-wm>	 (CirrusSearchNodeIndexingNotIncreasing) firing: (3) Elasticsearch instance elastic1074-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[22:47:20] <logmsgbot>	 !log zabe@deploy2002 Started scap: Backport for [[gerrit:903205|Rename "Support and Safety" to "Trust and Safety" (T330514)]]
[22:47:26] <stashbot>	 T330514: Rename group-wmf-supportsafety contents - https://phabricator.wikimedia.org/T330514
[22:48:34] <jinxer-wm>	 (WidespreadPuppetFailure) resolved: Puppet has failed on bastion cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=bastion - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[22:48:48] <mutante>	 !log stat1005 - kill 18179; run puppet ; stat1007 - kill 3346; run puppet ; stat1006 - kill 23887 run puppet
[22:48:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:52:06] <jinxer-wm>	 (CirrusSearchNodeIndexingNotIncreasing) firing: (6) Elasticsearch instance elastic1056-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[22:57:06] <jinxer-wm>	 (CirrusSearchNodeIndexingNotIncreasing) firing: (8) Elasticsearch instance elastic1056-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[23:00:01] <logmsgbot>	 !log zabe@deploy2002 zabe: Backport for [[gerrit:903205|Rename "Support and Safety" to "Trust and Safety" (T330514)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet
[23:00:11] <stashbot>	 T330514: Rename group-wmf-supportsafety contents - https://phabricator.wikimedia.org/T330514
[23:02:06] <jinxer-wm>	 (CirrusSearchNodeIndexingNotIncreasing) firing: (9) Elasticsearch instance elastic1056-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[23:02:11] <wikibugs>	 10SRE, 10Infrastructure Security, 10Infrastructure-Foundations: Offboard nfraison - https://phabricator.wikimedia.org/T333135 (10Dzahn) We got the "widespread puppet failures" alert which made me look at some random failed hosts in the list. I found the reason was this offboarding, because:  apt2001:   ` Err...
[23:02:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[23:03:34] <jinxer-wm>	 (SystemdUnitFailed) firing: (15) wmf_auto_restart_ircecho.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:07:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[23:08:34] <jinxer-wm>	 (SystemdUnitFailed) firing: (22) wmf_auto_restart_ircecho.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:08:48] <logmsgbot>	 !log zabe@deploy2002 Finished scap: Backport for [[gerrit:903205|Rename "Support and Safety" to "Trust and Safety" (T330514)]] (duration: 21m 27s)
[23:08:54] <stashbot>	 T330514: Rename group-wmf-supportsafety contents - https://phabricator.wikimedia.org/T330514
[23:09:26] <jinxer-wm>	 (WidespreadPuppetFailure) firing: Puppet has failed on lvs cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=lvs - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[23:10:11] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] peopleweb: replace Icinga with Prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/902801 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn)
[23:13:34] <jinxer-wm>	 (SystemdUnitFailed) firing: (60) wmf_auto_restart_ircecho.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:15:26] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "after https://gerrit.wikimedia.org/r/c/operations/puppet/+/903318 it's also not true anymore that this removes IRC notifications. they sho" [puppet] - 10https://gerrit.wikimedia.org/r/902785 (https://phabricator.wikimedia.org/T331901) (owner: 10Dzahn)
[23:17:29] <wikibugs>	 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10colewhite)
[23:18:15] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] etherpad: replace Icinga with Prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/902783 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn)
[23:21:49] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Engineering: Degraded RAID on an-worker1132 - https://phabricator.wikimedia.org/T333091 (10wiki_willy) a:03Jclark-ctr
[23:22:06] <jinxer-wm>	 (CirrusSearchNodeIndexingNotIncreasing) firing: (10) Elasticsearch instance elastic1055-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[23:22:35] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: Add-in Card 2 ROMB Battery LOW - https://phabricator.wikimedia.org/T332883 (10wiki_willy) a:03Jclark-ctr
[23:24:23] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Two failed disks in ms-be2067 - https://phabricator.wikimedia.org/T332983 (10wiki_willy) a:03Papaul
[23:24:26] <jinxer-wm>	 (WidespreadPuppetFailure) resolved: Puppet has failed on lvs cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=lvs - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[23:24:39] <wikibugs>	 10SRE, 10SRE-Access-Requests: Grant Hal deployment rights - https://phabricator.wikimedia.org/T331647 (10Htriedman) @MoritzMuehlenhoff Sorry if this is a silly question, but I've been trying to run commands as `analytics-platform-eng` on stat machines by using `sudo -u analytics-platform-eng <cmd>...` and am b...
[23:25:24] <wikibugs>	 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T331373 (10wiki_willy) a:03Jhancock.wm
[23:29:25] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Two failed disks in ms-be2067 - https://phabricator.wikimedia.org/T332983 (10wiki_willy) Hi guys - can we confirm the firmware is all up to date?  Thanks, Willy
[23:31:12] <zabe>	 !log deployed patch for T330968
[23:31:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:33:34] <jinxer-wm>	 (SystemdUnitFailed) firing: (58) wmf_auto_restart_ircecho.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:38:34] <jinxer-wm>	 (SystemdUnitFailed) firing: (63) wmf_auto_restart_ircecho.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:42:22] <wikibugs>	 10SRE, 10SRE-Access-Requests: Grant Hal deployment rights - https://phabricator.wikimedia.org/T331647 (10Dzahn) Hi @Htriedman and @MoritzMuehlenhoff,   the answer to this riddle is that while the special user "`analytics-platform-eng`" exists on all stat* machines, the admin group `analytics-platform-eng-admin...
[23:44:30] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "working per https://thanos.wikimedia.org/graph?g0.expr=probe_success%7Binstance%3D~%22.*people.*%22%7D&g0.tab=1&g0.stacked=0&g0.range_inpu" [puppet] - 10https://gerrit.wikimedia.org/r/902801 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn)
[23:44:42] <wikibugs>	 10SRE, 10SRE-Access-Requests: Grant Hal deployment rights - https://phabricator.wikimedia.org/T331647 (10Ottomata) > @Htriedman I think this comes down to a new access request like "add analytics-platform-eng-admins on stat* hosts".  Or ssh to an-airflow1004 and run your sudo cmd there :)
[23:47:08] <mutante>	 !log people1003 - taking down apache to provoke monitoring alert (inactive instances) and confirm IRC alerting change works
[23:47:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:50:06] <mutante>	 jinxer-wm: jinx it
[23:50:55] <jinxer-wm>	 (ProbeDown) firing: Service people1003:443 has failed probes (http_people_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#people1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[23:51:02] <icinga-wm>	 PROBLEM - people.wikimedia.org requires authentication on people1003 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 503 Service Unavailable https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[23:51:23] <mutante>	 oh, well, that worked but the Icinga part isnt gone
[23:51:31] <mutante>	 it was supposed to replace that
[23:52:42] <icinga-wm>	 RECOVERY - people.wikimedia.org requires authentication on people1003 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 586 bytes in 1.049 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[23:55:50] <jinxer-wm>	 (ProbeDown) resolved: Service people1003:443 has failed probes (http_people_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#people1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[23:59:07] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "confirmed this reports on IRC on both channels and also created a ticket, as desired" [puppet] - 10https://gerrit.wikimedia.org/r/902801 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn)