[00:00:28] <TheresNoTime>	 Even the long prefixes will be shorter than the entire domain name... I think
[00:00:34] <mutante>	 we had ALL the .wiki domains at first
[00:00:40] <mutante>	 en.wiki de.wiki .. and so on
[00:00:52] <mutante>	 but that's still wikipedia centric :)
[00:01:35] <bd808>	 TheresNoTime: the language code interwikis are project context sensitive. [[en:foo]] on fr.wikisource.org routes to enwikisource.org.
[00:02:18] <TheresNoTime>	 Ahh 
[00:02:24] <AntiComposite>	 wikipedia is the default though
[00:02:31] <bd808>	 there's a weird thing in that that on commons or meta the 'en' interwiki goes to wikipedia, but that's mostly weirdness
[00:03:03] <perryprog>	 Makes sense for common considering most interwiki links are linking to -> thing in image, generally speaking.
[00:03:13] <TheresNoTime>	 Well you do `:w:en:` or something right?
[00:03:19] <TheresNoTime>	 I should know this
[00:03:36] <perryprog>	 Eh, every time I make an interwiki link I have to check and I still don't remember
[00:05:47] <mutante>	 perryprog: haha, yea, and then you might think "is it really easier than if I could paste the full URL and link [ ] instead of [[]] ? :)
[00:06:06] <perryprog>	 noooo it has the little square thingy and a different visited link color
[00:06:11] <mutante>	 if you do someone will fix it for you either way though :) and tell you how you should do it right
[00:06:39] <TheresNoTime>	 Full URLs in log entries is the way /s
[00:07:43] <perryprog>	 I'm a firm believer that enterprise's diff-permalink.js script is possibly the best gadget ever written. (https://enwp.org/User:Enterprisey/diff-permalink.js)
[00:07:57] <perryprog>	 And that's only slight hyperbole.
[00:08:41] <TheresNoTime>	 Yet another of their scripts which should be in core 
[00:12:18] <legoktm>	 just paste the full URL into VisualEditor, it cleans it up for you and even interwikis if possible
[00:13:36] <legoktm>	 (2017 WTE does the same)
[00:15:02] <perryprog>	 what's visualeditor
[00:15:15] <perryprog>	 (also I don't see that happening in either)
[00:15:43] <perryprog>	 Oh I see, it does it sneakily
[00:16:13] <TheresNoTime>	 gah that'd be really useful if I used VE
[00:16:32] <perryprog>	 Though I don't see it happening in 2017 WTE (which I do use)
[00:17:21] <perryprog>	 Ohh it's just when you use the shortcut and with a different link text. I'll shut up now.
[00:25:05] <legoktm>	 yeah, you need to use the link insertion dialog
[00:31:32] <wikibugs>	 (03PS1) 10Tim Starling: Unprovision the "swift" dashboard [puppet] - 10https://gerrit.wikimedia.org/r/899885 (https://phabricator.wikimedia.org/T328872)
[00:36:26] <wikibugs>	 (03PS5) 10Aaron Schulz: DNM: add per-action component-level profiling in statsd using excimer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893839 (https://phabricator.wikimedia.org/T225968)
[00:49:20] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:00:38] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:06:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:14:34] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Two failed disks in ms-be2067 - https://phabricator.wikimedia.org/T331030 (10Papaul) Physical Disk 0:2:19 Foreign 19 7451.5 GB SATA HDD No
[02:21:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:45:26] <wikibugs>	 10SRE, 10Domains, 10Traffic: Acquire enwp.org - https://phabricator.wikimedia.org/T332220 (10Legoktm) >>! In T332220#8700530, @violetwtf wrote: > I think this spans a bit outside of "literally everything" -- enwp.org is widely used by Wikipedia editors in Wikipedia-adjacent channels to refer to Wikipedia.  I...
[03:21:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[04:04:44] <icinga-wm>	 PROBLEM - Check systemd state on cephosd1001 is CRITICAL: CRITICAL - degraded: The following units failed: ceph-mgr@cephosd1001.service,ceph-mon@cephosd1001.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:25:04] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1001 is CRITICAL: 3.758e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad
[04:56:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[05:09:58] <wikibugs>	 (03PS1) 10Gergő Tisza: Leveling up: Backport recent changes [extensions/GrowthExperiments] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/900026 (https://phabricator.wikimedia.org/T322387)
[05:17:51] <wikibugs>	 (03CR) 10Gergő Tisza: Leveling up: Backport recent changes (031 comment) [extensions/GrowthExperiments] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/900026 (https://phabricator.wikimedia.org/T322387) (owner: 10Gergő Tisza)
[05:23:06] <icinga-wm>	 RECOVERY - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1001 is OK: (C)1e+05 gt (W)1e+04 gt 8821 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad
[05:57:23] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1176: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/899602
[05:57:47] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "db1176: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/899602 (owner: 10Marostegui)
[05:58:45] <wikibugs>	 (03PS1) 10Marostegui: Revert "mariadb: Promote db1106 to m5 master" [puppet] - 10https://gerrit.wikimedia.org/r/899603
[05:58:51] <wikibugs>	 (03PS2) 10Marostegui: Revert "mariadb: Promote db1106 to m5 master" [puppet] - 10https://gerrit.wikimedia.org/r/899603
[05:59:28] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 6 hosts with reason: m5 master switch T332155
[05:59:33] <stashbot>	 T332155: Switchover m5 master (db1106 -> db1176) - https://phabricator.wikimedia.org/T332155
[05:59:55] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 6 hosts with reason: m5 master switch T332155
[06:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230316T0600)
[06:00:05] <jouncebot>	 kormat, marostegui, and Amir1: gettimeofday() says it's time for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230316T0600)
[06:01:52] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "mariadb: Promote db1106 to m5 master" [puppet] - 10https://gerrit.wikimedia.org/r/899603 (owner: 10Marostegui)
[06:03:50] <marostegui>	 !log Failover m5 from db1106 to db1176 - T332155
[06:03:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:09:31] <wikibugs>	 (03PS1) 10Marostegui: Revert "mariadb: Move db1106 to m5" [puppet] - 10https://gerrit.wikimedia.org/r/899604
[06:09:42] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Revert "mariadb: Move db1106 to m5" [puppet] - 10https://gerrit.wikimedia.org/r/899604 (owner: 10Marostegui)
[06:13:41] <wikibugs>	 (03Abandoned) 10Marostegui: Revert "mariadb: Move db1106 to m5" [puppet] - 10https://gerrit.wikimedia.org/r/899604 (owner: 10Marostegui)
[06:22:09] <wikibugs>	 (03PS1) 10Marostegui: instances.yaml: Remove db1105 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/900051 (https://phabricator.wikimedia.org/T331874)
[06:22:34] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] instances.yaml: Remove db1105 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/900051 (https://phabricator.wikimedia.org/T331874) (owner: 10Marostegui)
[06:23:07] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db1105 from dbctl T331874', diff saved to https://phabricator.wikimedia.org/P45883 and previous config saved to /var/cache/conftool/dbconfig/20230316-062307-root.json
[06:23:13] <stashbot>	 T331874: decommission db1105.eqiad.wmnet - https://phabricator.wikimedia.org/T331874
[06:29:20] <wikibugs>	 (03CR) 10Krinkle: "Thanks for restoring these! We had renamed these in the Prometheus clean up some months ago." [alerts] - 10https://gerrit.wikimedia.org/r/899506 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi)
[06:59:02] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Move db1106 to s1 [puppet] - 10https://gerrit.wikimedia.org/r/900086 (https://phabricator.wikimedia.org/T332270)
[06:59:48] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Move db1106 to s1 [puppet] - 10https://gerrit.wikimedia.org/r/900086 (https://phabricator.wikimedia.org/T332270) (owner: 10Marostegui)
[07:00:05] <jouncebot>	 Amir1, apergos, and jnuche: OwO what's this, a deployment window?? UTC morning backport and config training. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230316T0700). nyaa~
[07:00:05] <jouncebot>	 tgr: A patch you scheduled for UTC morning backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:00:09] <apergos>	 morning! there are no trainees signed up today, and one patch in the window. I note that it consists of multiple php files; bear in mind that scap does not guarantee the order that these files will be copied into place; are the changes in each file independent of each other, so that there will be no errors? tgr_ that question's for you. and I assume you will be self-deploying today?
[07:01:42] <wikibugs>	 (03PS1) 10Marostegui: db1106: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/900090 (https://phabricator.wikimedia.org/T332270)
[07:02:29] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1106: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/900090 (https://phabricator.wikimedia.org/T332270) (owner: 10Marostegui)
[07:04:57] <tgr_>	 apergos: yeah. The code is only executed on group2 so should be fine.
[07:05:36] <apergos>	 it's not a matter of which group of wikis but rather any interdependency among the patched phph modules (if there is any)
[07:05:46] <apergos>	 *php
[07:06:20] <tgr_>	 well, if the code won't run, the interdependencies won't matter
[07:06:31] <tgr_>	 and .27 is not on group2 yet
[07:07:25] <apergos>	 ah so the branch isn't out to group2, got it
[07:07:29] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by tgr@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/900026 (https://phabricator.wikimedia.org/T322387) (owner: 10Gergő Tisza)
[07:25:56] <wikibugs>	 (03Merged) 10jenkins-bot: Leveling up: Backport recent changes [extensions/GrowthExperiments] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/900026 (https://phabricator.wikimedia.org/T322387) (owner: 10Gergő Tisza)
[07:26:25] <logmsgbot>	 !log tgr@deploy2002 Started scap: Backport for [[gerrit:900026|Leveling up: Backport recent changes]]
[07:28:05] <logmsgbot>	 !log tgr@deploy2002 tgr: Backport for [[gerrit:900026|Leveling up: Backport recent changes]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet
[07:31:25] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack C8 - https://phabricator.wikimedia.org/T308339 (10Marostegui) I am removing the DBA tag from this task as there are no more databases pending here. I will remain subscribed in case I am needed.
[07:31:44] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] "LGTM! Could you pcc just to be sure?" [puppet] - 10https://gerrit.wikimedia.org/r/899641 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm)
[07:31:45] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack C8 - https://phabricator.wikimedia.org/T308339 (10Marostegui) Oh, nevermind, db1131 is still to be moved.
[07:32:04] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] k8s: Make profile::kubernetes::pki::intermediate mandatory [puppet] - 10https://gerrit.wikimedia.org/r/899642 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm)
[07:33:52] <tgr_>	 looks like the mediawiki-errors dashboard's chart isn't working
[07:33:55] <tgr_>	 not ideal for deployments
[07:34:05] <apergos>	 not too good is it
[07:34:38] <logmsgbot>	 !log tgr@deploy2002 Finished scap: Backport for [[gerrit:900026|Leveling up: Backport recent changes]] (duration: 08m 13s)
[07:34:58] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (PATCH pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[07:36:01] <tgr_>	 !log UTC morning deploys done
[07:36:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:36:13] <apergos>	 ah it's all good?
[07:36:25] <apergos>	 I was just going to ask how testing was going on the production cluster
[07:36:55] <apergos>	 welp, never mind then, and see everyone next time!
[07:36:58] <tgr_>	 Well, I hope. I can't see the error trends. But in theory this deploy should have been a no-op.
[07:37:13] <tgr_>	 (Also not testable)
[07:37:29] <apergos>	 I looked at the logs from mwlog2002 and its fine 
[07:37:43] <tgr_>	 cool, thanks
[07:37:54] <apergos>	 I can keep that tab open for a few more minutes if there's a concern
[07:38:04] <apergos>	 please report the issue with the one dashboard, that's concerning
[07:39:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (PATCH pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[07:41:23] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] k8s: Remove 1.16 related code [puppet] - 10https://gerrit.wikimedia.org/r/899652 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm)
[07:41:38] <wikibugs>	 (03Abandoned) 10Elukey: services: add staging config for Lift Wing to the API gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/898741 (owner: 10Elukey)
[07:44:26] <tgr_>	 filed: https://phabricator.wikimedia.org/T332273
[07:45:02] <apergos>	 subscribed, thanks
[07:45:24] <apergos>	 still nothing new in the logs so I'm calling it good
[07:55:13] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] Move cloudsw prefix-list filters from templates to YAML [homer/public] - 10https://gerrit.wikimedia.org/r/898684 (https://phabricator.wikimedia.org/T327919) (owner: 10Cathal Mooney)
[07:57:22] <wikibugs>	 10SRE, 10Release-Engineering-Team, 10Wikimedia-Logstash: mediawiki-errors logstash dashboard's "errors over time" panel broken - https://phabricator.wikimedia.org/T332273 (10kostajh)
[07:59:48] <kostajh>	 apergos: can I backport two more patches, or should I leave it for later?
[08:00:11] <wikibugs>	 (03PS1) 10Kosta Harlan: Leveling up: always set wgGELevelingUpEnabledForUser [extensions/GrowthExperiments] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/899605 (https://phabricator.wikimedia.org/T332227)
[08:00:27] <apergos>	 window is over, it's already the hour passed,  kostajh
[08:00:29] <wikibugs>	 (03PS1) 10Kosta Harlan: SuggestedEditSession: Fix handling of post-save data refresh [extensions/GrowthExperiments] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/900126
[08:00:49] <apergos>	 so next window now (daylight confusion time might have got ya)
[08:01:13] <kostajh>	 I know it's just ending/ended, but if there's not anything next it might be ok to just do now
[08:02:32] <apergos>	 I don't know, the citoid window starts now, you'd need to talk to them
[08:03:38] <kostajh>	 it says it's in 2 hours? https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230316T1000
[08:04:13] <apergos>	 https://wikitech.wikimedia.org/wiki/Deployments#Thursday,_March_16 oh I see, I'm mixing up the timezones
[08:04:20] <apergos>	 well what do you have? are you a self deployer?
[08:04:30] <apergos>	 (sorry I always have to ask, I never remember...)
[08:04:31] <kostajh>	 yes, I can deploy myself
[08:05:18] <apergos>	 and which is the patch?
[08:06:12] <kostajh>	 apergos: just added to imedia.org/wiki/Deployments#deploycal-item-20230316T0700
[08:06:29] <wikibugs>	 10SRE, 10Release-Engineering-Team, 10SRE Observability, 10Wikimedia-Logstash: mediawiki-errors logstash dashboard's "errors over time" panel broken - https://phabricator.wikimedia.org/T332273 (10MoritzMuehlenhoff)
[08:06:54] <apergos>	 how long are these going to take to merge?
[08:07:01] <kostajh>	 ~20 minutes I imagine
[08:07:28] <apergos>	 ok go ahead (but next time try to get here for the window ;-) )
[08:07:40] <kostajh>	 thx
[08:08:11] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for ItamarWMDE - https://phabricator.wikimedia.org/T331899 (10MoritzMuehlenhoff)
[08:08:59] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/900126 (owner: 10Kosta Harlan)
[08:09:06] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/899605 (https://phabricator.wikimedia.org/T332227) (owner: 10Kosta Harlan)
[08:10:12] <wikibugs>	 (03PS1) 10Muehlenhoff: Add SSH key and Kerberos principal for ptiwary [puppet] - 10https://gerrit.wikimedia.org/r/900122 (https://phabricator.wikimedia.org/T332214)
[08:11:50] <apergos>	 !log additional deployments for the  UTC morning backport and config training window, running into the next hour, so window re-opened
[08:11:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:11:59] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add SSH key and Kerberos principal for ptiwary [puppet] - 10https://gerrit.wikimedia.org/r/900122 (https://phabricator.wikimedia.org/T332214) (owner: 10Muehlenhoff)
[08:13:03] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to ssh for jupyter notebooks for Prabhat - https://phabricator.wikimedia.org/T332214 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff @prabhat I've enabled your access, but it will take up to 30 minutes until the change ha...
[08:27:19] <wikibugs>	 (03Merged) 10jenkins-bot: SuggestedEditSession: Fix handling of post-save data refresh [extensions/GrowthExperiments] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/900126 (owner: 10Kosta Harlan)
[08:27:25] <wikibugs>	 (03Merged) 10jenkins-bot: Leveling up: always set wgGELevelingUpEnabledForUser [extensions/GrowthExperiments] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/899605 (https://phabricator.wikimedia.org/T332227) (owner: 10Kosta Harlan)
[08:27:38] <kostajh>	 apergos: verifying the patch on testwiki, then I'll sync
[08:27:42] <logmsgbot>	 !log kharlan@deploy2002 Started scap: Backport for [[gerrit:900126|SuggestedEditSession: Fix handling of post-save data refresh]], [[gerrit:899605|Leveling up: always set wgGELevelingUpEnabledForUser (T332227)]]
[08:27:48] <stashbot>	 T332227: Leveling up: Try new task panel does not display for edits made with source editor - https://phabricator.wikimedia.org/T332227
[08:29:15] <logmsgbot>	 !log kharlan@deploy2002 kharlan: Backport for [[gerrit:900126|SuggestedEditSession: Fix handling of post-save data refresh]], [[gerrit:899605|Leveling up: always set wgGELevelingUpEnabledForUser (T332227)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet
[08:30:56] <apergos>	 I've got the logs pulled up and ready
[08:33:11] <kostajh>	 apergos: lgtm from the user side
[08:34:00] <apergos>	 a disclaimer, I have the production logs pulled up, sory, I should have been clearer
[08:34:11] <kostajh>	 mwdebug-logs look ok
[08:34:13] <kostajh>	 I'm syncing
[08:34:20] <apergos>	 okey dokey
[08:37:01] <wikibugs>	 (03PS1) 10Kosta Harlan: GrowthExperiments: Enable LevelingUp features on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900196 (https://phabricator.wikimedia.org/T317813)
[08:37:27] <kostajh>	 apergos: can I also sneak ^ in? then I'm done, really 😅
[08:37:36] <apergos>	 uh
[08:37:42] <apergos>	 there's only 20 minutes left
[08:37:59] <kostajh>	 that one is just a config patch, IIRC that takes ~2-5 minutes
[08:38:13] <kostajh>	 but maybe it is a longer process depending on docker image building. I can leave it for later if you prefer.
[08:38:19] <apergos>	 ok (but if it doesn't, I get to whine about it in your ar mercilessly for awhile :-P )
[08:38:33] <apergos>	 I'd rather you leave it for next time
[08:38:39] <kostajh>	 yeah I'll leave it
[08:38:39] <apergos>	 there's lots of second chances:-)
[08:39:53] <kostajh>	 ack
[08:40:13] <logmsgbot>	 !log kharlan@deploy2002 Finished scap: Backport for [[gerrit:900126|SuggestedEditSession: Fix handling of post-save data refresh]], [[gerrit:899605|Leveling up: always set wgGELevelingUpEnabledForUser (T332227)]] (duration: 12m 30s)
[08:40:18] <stashbot>	 T332227: Leveling up: Try new task panel does not display for edits made with source editor - https://phabricator.wikimedia.org/T332227
[08:40:48] <kostajh>	 !log UTC morning deploys (second round) done
[08:40:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:41:03] <kostajh>	 thanks apergos, 'til next time...
[08:42:37] <apergos>	 see ya!
[08:43:45] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Decommission db1105 [puppet] - 10https://gerrit.wikimedia.org/r/900197 (https://phabricator.wikimedia.org/T331874)
[08:43:58] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission for hosts db1105.eqiad.wmnet
[08:48:05] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.dns.netbox
[08:48:45] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db1105 [puppet] - 10https://gerrit.wikimedia.org/r/900197 (https://phabricator.wikimedia.org/T331874) (owner: 10Marostegui)
[08:49:05] <wikibugs>	 10SRE-swift-storage, 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 10): Storage request: swift s3 bucket for mediawiki-page-content-change-enrichment checkpointing - https://phabricator.wikimedia.org/T330693 (10dcausse) Hi everyone and sorry to jump into this conversion but just wanted t...
[08:49:55] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1105.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1001"
[08:50:22] <wikibugs>	 10ops-eqiad, 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission db1105.eqiad.wmnet - https://phabricator.wikimedia.org/T331874 (10Marostegui) a:05Marostegui→03None
[08:51:02] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1105.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1001"
[08:51:03] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[08:51:03] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db1105.eqiad.wmnet
[08:51:10] <wikibugs>	 10ops-eqiad, 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission db1105.eqiad.wmnet - https://phabricator.wikimedia.org/T331874 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by marostegui@cumin1001 for hosts: `db1105.eqiad.wmnet` - db1105.eqiad.wmnet (**WARN**)   - Downtimed...
[08:51:14] <wikibugs>	 10ops-eqiad, 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission db1105.eqiad.wmnet - https://phabricator.wikimedia.org/T331874 (10Marostegui) This is ready for DC-Ops
[08:51:27] <wikibugs>	 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review: decommission db1105.eqiad.wmnet - https://phabricator.wikimedia.org/T331874 (10Marostegui)
[09:00:06] <jinxer-wm>	 (CirrusSearchJobQueueBacklogTooBig) firing: CirrusSearch job topic codfw.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 234k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig
[09:06:31] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] perf: fix webperf metric names (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/899506 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi)
[09:20:06] <jinxer-wm>	 (CirrusSearchJobQueueBacklogTooBig) resolved: CirrusSearch job topic codfw.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 203.2k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig
[09:26:29] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10Marostegui)
[09:42:28] <icinga-wm>	 PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: / (spec from root) timed out before a response was received: /{domain}/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[09:43:30] <wikibugs>	 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 9 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10fgiunchedi)
[09:44:30] <wikibugs>	 (03CR) 10Ayounsi: Management routers: move ssh port to 2222 (032 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/890811 (https://phabricator.wikimedia.org/T277438) (owner: 10Ayounsi)
[09:46:06] <icinga-wm>	 RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[09:55:11] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops: Find a sensible way to redirect traffic to mw-on-k8s - https://phabricator.wikimedia.org/T331318 (10Joe) After some thought, I think the most maintainable way to do this is to add an additional lua module to the maps for api/appservers.  Specifically, this would...
[10:00:05] <jouncebot>	 mvolz: May I have your attention please! Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230316T1000)
[10:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230316T1000)
[10:03:57] <wikibugs>	 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 9 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10fgiunchedi)
[10:09:45] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1179 to move it to x1', diff saved to https://phabricator.wikimedia.org/P45885 and previous config saved to /var/cache/conftool/dbconfig/20230316-100945-root.json
[10:09:48] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10Wikidata, and 2 others: Migrate testwikidata to Kubernetes - https://phabricator.wikimedia.org/T331268 (10Clement_Goubert)
[10:10:00] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Clement_Goubert)
[10:10:32] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!  In theory the cloudlb (and cloudgw), would not need the static route, as they will learn it through BGP from the cloudsw.  The clou" [puppet] - 10https://gerrit.wikimedia.org/r/899616 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez)
[10:12:32] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10serviceops-radar, 10Datacenter-Switchover: Cookbooks that do DNS discovery change should check recdns - https://phabricator.wikimedia.org/T332009 (10Joe) I am not sure what would be the goal of checking the dns recursors in that situation, as far as running the cookbo...
[10:15:04] <wikibugs>	 10SRE-swift-storage, 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 10): Storage request: swift s3 bucket for mediawiki-page-content-change-enrichment checkpointing - https://phabricator.wikimedia.org/T330693 (10gmodena) >>! In T330693#8698639, @MatthewVernon wrote: > This is a k8s applic...
[10:15:30] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+1] Assign mediawiki roles to mw2420-mw2451 [puppet] - 10https://gerrit.wikimedia.org/r/896063 (https://phabricator.wikimedia.org/T326363) (owner: 10Clément Goubert)
[10:25:00] <wikibugs>	 10SRE-swift-storage, 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 10): Storage request: swift s3 bucket for mediawiki-page-content-change-enrichment checkpointing - https://phabricator.wikimedia.org/T330693 (10gmodena) Hi Eric,  > I know; I didn't mean for this to come across as an indi...
[10:25:50] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Invesitgate requirement for 'session-mode auatomatic' on EVPN iBGP peerings - https://phabricator.wikimedia.org/T332295 (10cmooney) p:05Triage→03Low
[10:26:27] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
[10:28:52] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
[10:28:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:29:56] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' .
[10:30:14] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' .
[10:31:12] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' .
[10:31:18] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' .
[10:31:48] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
[10:31:53] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
[10:31:57] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_codfw
[10:32:13] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_codfw
[10:32:41] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 32 hosts with reason: new_install
[10:32:54] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
[10:33:03] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 32 hosts with reason: new_install
[10:33:12] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review: mw2420-mw2451 service implementation tracking - https://phabricator.wikimedia.org/T326363 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=c5ba1cf2-f027-43f9-8672-b4eb30f98ddc) set by cgoubert@cumin1001 for 1:00:00 on 32 host(s) and their services w...
[10:33:15] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] Assign mediawiki roles to mw2420-mw2451 [puppet] - 10https://gerrit.wikimedia.org/r/896063 (https://phabricator.wikimedia.org/T326363) (owner: 10Clément Goubert)
[10:33:38] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
[10:33:47] <wikibugs>	 (03PS1) 10Filippo Giunchedi: DNM: test alertmanager depool for prometheus1006 [puppet] - 10https://gerrit.wikimedia.org/r/900238 (https://phabricator.wikimedia.org/T331449)
[10:33:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:34:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:36:34] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: ml-services: allow scale-to-zero for staging deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/900239 (https://phabricator.wikimedia.org/T325763)
[10:37:20] <wikibugs>	 (03PS2) 10Ilias Sarantopoulos: ml-services: allow scale-to-zero for staging deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/900239 (https://phabricator.wikimedia.org/T325763)
[10:37:45] <wikibugs>	 (03PS1) 10Elukey: admin_ng: limit the experimental namespace to ml-staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/900240
[10:37:52] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_eqsin
[10:38:04] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_eqsin
[10:38:54] <wikibugs>	 (03PS2) 10Elukey: admin_ng: limit the experimental namespace to ml-staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/900240
[10:39:13] <jinxer-wm>	 (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:39:52] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
[10:40:28] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:40:37] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
[10:40:56] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Move cloudsw prefix-list filters from templates to YAML [homer/public] - 10https://gerrit.wikimedia.org/r/898684 (https://phabricator.wikimedia.org/T327919) (owner: 10Cathal Mooney)
[10:42:03] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
[10:42:19] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
[10:44:13] <jinxer-wm>	 (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:44:44] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] admin_ng: limit the experimental namespace to ml-staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/900240 (owner: 10Elukey)
[10:45:52] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] ml-services: allow scale-to-zero for staging deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/900239 (https://phabricator.wikimedia.org/T325763) (owner: 10Ilias Sarantopoulos)
[10:46:09] <hashar>	 I might have broken Jenkins somehow
[10:46:16] <wikibugs>	 (03Merged) 10jenkins-bot: Move cloudsw prefix-list filters from templates to YAML [homer/public] - 10https://gerrit.wikimedia.org/r/898684 (https://phabricator.wikimedia.org/T327919) (owner: 10Cathal Mooney)
[10:47:46] <hashar>	 well maybe not :]
[10:48:18] <wikibugs>	 (03PS6) 10Cathal Mooney: Add automation for EVPN BGP peerings [homer/public] - 10https://gerrit.wikimedia.org/r/894741 (https://phabricator.wikimedia.org/T327934)
[10:48:20] <icinga-wm>	 PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01271 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[10:48:25] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add automation for EVPN BGP peerings [homer/public] - 10https://gerrit.wikimedia.org/r/894741 (https://phabricator.wikimedia.org/T327934) (owner: 10Cathal Mooney)
[10:49:26] <wikibugs>	 (03PS4) 10Jameel Kaisar: Create and deploy per-CDN-site DNS domains [dns] - 10https://gerrit.wikimedia.org/r/899214
[10:49:36] <icinga-wm>	 PROBLEM - Check systemd state on puppetdb2002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_stockpile_queue.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:50:00] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_codfw
[10:51:06] <jinxer-wm>	 (CirrusSearchJobQueueBacklogTooBig) firing: CirrusSearch job topic codfw.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 204.8k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig
[10:51:28] <icinga-wm>	 RECOVERY - Check systemd state on puppetdb2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:52:27] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_codfw
[10:55:00] <wikibugs>	 (03PS1) 10Filippo Giunchedi: traffic: deploy alerts to 'ops' instance [alerts] - 10https://gerrit.wikimedia.org/r/900241 (https://phabricator.wikimedia.org/T309182)
[10:55:51] <wikibugs>	 (03PS1) 10Jbond: sre.netbox.deploy-extras: create a cookbook to deploy netbox-extras [cookbooks] - 10https://gerrit.wikimedia.org/r/900242 (https://phabricator.wikimedia.org/T310590)
[10:56:21] <wikibugs>	 (03PS7) 10Cathal Mooney: Add automation for EVPN BGP peerings [homer/public] - 10https://gerrit.wikimedia.org/r/894741 (https://phabricator.wikimedia.org/T327934)
[10:56:31] <wikibugs>	 (03PS10) 10Nicolas Fraison: spark-operator: enable spark operator mutation webhook [deployment-charts] - 10https://gerrit.wikimedia.org/r/897895 (https://phabricator.wikimedia.org/T331858)
[10:56:33] <wikibugs>	 (03PS3) 10Nicolas Fraison: spark: Allow communication from spark pods to HDFS/Hive [deployment-charts] - 10https://gerrit.wikimedia.org/r/899630 (https://phabricator.wikimedia.org/T331859)
[10:56:45] <wikibugs>	 (03CR) 10Cathal Mooney: Add automation for EVPN BGP peerings (034 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/894741 (https://phabricator.wikimedia.org/T327934) (owner: 10Cathal Mooney)
[10:57:31] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Note that this might uncover some problems with the alerts once deployed, I'll followup with further patches to fix the alerts" [alerts] - 10https://gerrit.wikimedia.org/r/900241 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi)
[10:58:13] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.netbox.deploy-extras: create a cookbook to deploy netbox-extras [cookbooks] - 10https://gerrit.wikimedia.org/r/900242 (https://phabricator.wikimedia.org/T310590) (owner: 10Jbond)
[10:59:02] <wikibugs>	 (03PS2) 10Jbond: sre.netbox.deploy-extras: create a cookbook to deploy netbox-extras [cookbooks] - 10https://gerrit.wikimedia.org/r/900242 (https://phabricator.wikimedia.org/T310590)
[11:00:18] <wikibugs>	 (03PS11) 10Nicolas Fraison: spark-operator: enable spark operator mutation webhook [deployment-charts] - 10https://gerrit.wikimedia.org/r/897895 (https://phabricator.wikimedia.org/T331858)
[11:00:20] <wikibugs>	 (03PS4) 10Nicolas Fraison: spark: Allow communication from spark pods to HDFS/Hive [deployment-charts] - 10https://gerrit.wikimedia.org/r/899630 (https://phabricator.wikimedia.org/T331859)
[11:01:07] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.netbox.deploy-extras: create a cookbook to deploy netbox-extras [cookbooks] - 10https://gerrit.wikimedia.org/r/900242 (https://phabricator.wikimedia.org/T310590) (owner: 10Jbond)
[11:01:09] <wikibugs>	 (03PS3) 10Jbond: sre.netbox.deploy-extras: create a cookbook to deploy netbox-extras [cookbooks] - 10https://gerrit.wikimedia.org/r/900242 (https://phabricator.wikimedia.org/T310590)
[11:03:20] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.netbox.deploy-extras: create a cookbook to deploy netbox-extras [cookbooks] - 10https://gerrit.wikimedia.org/r/900242 (https://phabricator.wikimedia.org/T310590) (owner: 10Jbond)
[11:03:22] <wikibugs>	 (03PS4) 10Jbond: sre.netbox.deploy-extras: create a cookbook to deploy netbox-extras [cookbooks] - 10https://gerrit.wikimedia.org/r/900242 (https://phabricator.wikimedia.org/T310590)
[11:04:22] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes:weight=4; selector: service=thumbor,name=kubernetes101[0123].eqiad.wmnet
[11:04:23] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] traffic: deploy alerts to 'ops' instance [alerts] - 10https://gerrit.wikimedia.org/r/900241 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi)
[11:05:46] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.netbox.deploy-extras: create a cookbook to deploy netbox-extras [cookbooks] - 10https://gerrit.wikimedia.org/r/900242 (https://phabricator.wikimedia.org/T310590) (owner: 10Jbond)
[11:06:01] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_drmrs
[11:06:07] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_drmrs
[11:06:37] <robh>	 jclark-ctr: cables order for ya will ship from California on 4/11
[11:07:24] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_eqsin
[11:07:48] <vgutierrez>	 robh: you just messed my internal clock by chatting this early in the day :P
[11:08:25] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] cloudlb: introduce cloud-private IP address [puppet] - 10https://gerrit.wikimedia.org/r/899569 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez)
[11:09:01] <wikibugs>	 (03CR) 10JMeybohm: "Giuseppe added some functionality to check/verify the state of discovery DNS, maybe that could be of use here as well:" [cookbooks] - 10https://gerrit.wikimedia.org/r/849130 (owner: 10Jbond)
[11:09:45] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloud_private_subnet: add route to other similar subnets [puppet] - 10https://gerrit.wikimedia.org/r/899616 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez)
[11:10:17] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/weight=2; selector: service=thumbor,name=kubernetes101[0123].eqiad.wmnet
[11:10:33] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] traffic: deploy alerts to 'ops' instance [alerts] - 10https://gerrit.wikimedia.org/r/900241 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi)
[11:11:06] <jinxer-wm>	 (CirrusSearchJobQueueBacklogTooBig) resolved: CirrusSearch job topic codfw.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 202.4k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig
[11:16:33] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on 32 hosts with reason: new_install
[11:16:56] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on 32 hosts with reason: new_install
[11:17:02] <wikibugs>	 10SRE, 10serviceops: mw2420-mw2451 service implementation tracking - https://phabricator.wikimedia.org/T326363 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=17f33514-0b87-4f50-abfa-6cd2e1548410) set by cgoubert@cumin1001 for 5:00:00 on 32 host(s) and their services with reason: new_instal...
[11:20:00] <wikibugs>	 (03PS5) 10Jbond: sre.netbox.deploy-extras: create a cookbook to deploy netbox-extras [cookbooks] - 10https://gerrit.wikimedia.org/r/900242 (https://phabricator.wikimedia.org/T310590)
[11:23:21] <wikibugs>	 10SRE-swift-storage, 10MediaWiki-File-management: FileBackendMultiWrite multi-dc and thumbnail handling - https://phabricator.wikimedia.org/T331138 (10kostajh) >>! In T331138#8698560, @Xover wrote: >>>! In T331138#8676812, @thcipriani wrote: >> I checked [[ https://www.mediawiki.org/wiki/Developers/Maintainers...
[11:24:23] <robh>	 bleh wrong channel cuz its too early heh
[11:24:27] <wikibugs>	 (03PS1) 10Clément Goubert: scap: Fix bootstrap-scap-target.sh exec [puppet] - 10https://gerrit.wikimedia.org/r/900252
[11:27:12] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/weight=3; selector: service=thumbor,name=kubernetes101[0123].eqiad.wmnet
[11:29:07] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_drmrs
[11:29:26] <wikibugs>	 (03CR) 10Elukey: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/900240 (owner: 10Elukey)
[11:30:40] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10aborrero) >>! In T327919#8699523, @cmooney wrote: >  > In terms of the move we need to work with @aborr...
[11:30:55] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_drmrs
[11:32:06] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_esams
[11:32:10] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_eqsin
[11:32:16] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_esams
[11:36:08] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] scap: Fix bootstrap-scap-target.sh exec [puppet] - 10https://gerrit.wikimedia.org/r/900252 (owner: 10Clément Goubert)
[11:36:14] <wikibugs>	 (03CR) 10Jbond: sre.{ganeti,hosts}.reimage: Confirm with hostname (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/899772 (https://phabricator.wikimedia.org/T332202) (owner: 10BCornwall)
[11:37:27] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/weight=4; selector: service=thumbor,name=kubernetes101[0123].eqiad.wmnet
[11:39:03] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloud_private_subnet: don't use CIDR for gateway parameter in static route [puppet] - 10https://gerrit.wikimedia.org/r/900262 (https://phabricator.wikimedia.org/T324992)
[11:41:00] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] cloud_private_subnet: don't use CIDR for gateway parameter in static route [puppet] - 10https://gerrit.wikimedia.org/r/900262 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez)
[11:42:16] <wikibugs>	 (03PS1) 10Cathal Mooney: Merge branch 'master' of ssh://gerrit.wikimedia.org:29418/operations/homer/public [homer/public] - 10https://gerrit.wikimedia.org/r/900263
[11:42:18] <wikibugs>	 (03PS1) 10Cathal Mooney: Add protocol direct to Cloud_outfilter protocols [homer/public] - 10https://gerrit.wikimedia.org/r/900264 (https://phabricator.wikimedia.org/T327919)
[11:43:10] <wikibugs>	 (03Abandoned) 10Cathal Mooney: Merge branch 'master' of ssh://gerrit.wikimedia.org:29418/operations/homer/public [homer/public] - 10https://gerrit.wikimedia.org/r/900263 (owner: 10Cathal Mooney)
[11:43:17] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=inactive; selector: service=thumbor,name=kubernetes101[0123].eqiad.wmnet
[11:43:29] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Add protocol direct to Cloud_outfilter protocols [homer/public] - 10https://gerrit.wikimedia.org/r/900264 (https://phabricator.wikimedia.org/T327919) (owner: 10Cathal Mooney)
[11:47:43] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: cloud_private_subnet: don't use CIDR for gateway parameter in static route [puppet] - 10https://gerrit.wikimedia.org/r/900262 (https://phabricator.wikimedia.org/T324992)
[11:52:03] <wikibugs>	 (03CR) 10Volans: Management routers: move ssh port to 2222 (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/890811 (https://phabricator.wikimedia.org/T277438) (owner: 10Ayounsi)
[11:54:19] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_esams
[11:54:47] <icinga-wm>	 RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.002937 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[11:56:24] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_esams
[11:56:25] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloud_private_subnet: don't use CIDR for gateway parameter in static route [puppet] - 10https://gerrit.wikimedia.org/r/900262 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez)
[11:56:37] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_eqiad
[11:56:45] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_eqiad
[11:57:45] <wikibugs>	 (03PS2) 10Cathal Mooney: Add protocol direct to Cloud_outfilter protocols [homer/public] - 10https://gerrit.wikimedia.org/r/900264 (https://phabricator.wikimedia.org/T327919)
[11:58:39] <wikibugs>	 (03CR) 10JMeybohm: k8s: Force to be explicit about k8s and calico versions (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/899641 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm)
[12:04:25] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[12:04:53] <icinga-wm>	 PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[12:05:54] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 32 hosts with reason: new_install
[12:08:44] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 32 hosts with reason: new_install
[12:08:50] <wikibugs>	 10SRE, 10serviceops: mw2420-mw2451 service implementation tracking - https://phabricator.wikimedia.org/T326363 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=f7f64d19-c64a-4fb5-a8ab-f3218dfd9862) set by cgoubert@cumin1001 for 1:00:00 on 32 host(s) and their services with reason: new_instal...
[12:09:26] <wikibugs>	 (03PS1) 10Slyngshede: Password reset - Allow users to request a password reset. [software/bitu] - 10https://gerrit.wikimedia.org/r/900277
[12:12:03] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 13 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[12:14:17] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_eqiad
[12:14:25] <icinga-wm>	 RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[12:16:20] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_eqiad
[12:24:21] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm just a typo and a minor optional nit" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/889958 (https://phabricator.wikimedia.org/T310590) (owner: 10Ayounsi)
[12:25:11] <wikibugs>	 (03CR) 10Nicolas Fraison: spark-operator: enable spark operator mutation webhook (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/897895 (https://phabricator.wikimedia.org/T331858) (owner: 10Nicolas Fraison)
[12:30:56] <wikibugs>	 (03PS1) 10Slyngshede: get_single_object - get modified timestamp [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/900304
[12:31:15] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] get_single_object - get modified timestamp [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/900304 (owner: 10Slyngshede)
[12:32:08] <wikibugs>	 (03PS2) 10Slyngshede: get_single_object - get modified timestamp [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/900304
[12:34:19] <wikibugs>	 (03PS13) 10KartikMistry: WIP: Add new self hosted machinetranslation service (MinT) [deployment-charts] - 10https://gerrit.wikimedia.org/r/897634 (https://phabricator.wikimedia.org/T331505)
[12:38:06] <wikibugs>	 (03CR) 10Ssingh: sre.{ganeti,hosts}.reimage: Confirm with hostname (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/899772 (https://phabricator.wikimedia.org/T332202) (owner: 10BCornwall)
[13:00:05] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230316T1300)
[13:00:05] <jouncebot>	 xSavitar and raynor: A patch you scheduled for Mobileapps/RESTBase/Wikifeeds is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: That opportune time is upon us again. Time for a UTC afternoon backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230316T1300).
[13:00:05] <jouncebot>	 kostajh: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:20] <kostajh>	 hello
[13:00:37] <TheresNoTime>	 (I can deploy in 5m)
[13:00:57] <kostajh>	 I don't mind deploying myself
[13:01:04] <wikibugs>	 (03PS5) 10JMeybohm: k8s: Make profile::kubernetes::pki::intermediate mandatory [puppet] - 10https://gerrit.wikimedia.org/r/899642 (https://phabricator.wikimedia.org/T328291)
[13:01:06] <wikibugs>	 (03PS3) 10JMeybohm: k8s: Remove 1.16 related code [puppet] - 10https://gerrit.wikimedia.org/r/899652 (https://phabricator.wikimedia.org/T328291)
[13:01:12] <TheresNoTime>	 kostajh: go ahead! :)
[13:02:44] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900196 (https://phabricator.wikimedia.org/T317813) (owner: 10Kosta Harlan)
[13:02:46] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/894593 (owner: 10Kosta Harlan)
[13:02:54] <wikibugs>	 (03PS2) 10Kosta Harlan: GrowthExperiments: Remove unused GENewImpactD3Enabled flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/894593
[13:03:02] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900196 (https://phabricator.wikimedia.org/T317813) (owner: 10Kosta Harlan)
[13:03:04] <wikibugs>	 (03CR) 10TrainBranchBot: "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/894593 (owner: 10Kosta Harlan)
[13:03:58] <claime>	 TheresNoTime: kostajh If you get issues with mw24[20-52] that's my fault, they're being commissioned. They shouldn't cause issues because they're not pooled but I'd rather give you a heads up.
[13:03:59] <wikibugs>	 (03Merged) 10jenkins-bot: GrowthExperiments: Enable LevelingUp features on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900196 (https://phabricator.wikimedia.org/T317813) (owner: 10Kosta Harlan)
[13:04:41] <kostajh>	 claime: thanks for letting me know
[13:05:39] <logmsgbot>	 !log kharlan@deploy2002 Started scap: Backport for [[gerrit:900196|GrowthExperiments: Enable LevelingUp features on testwiki (T317813)]]
[13:05:45] <stashbot>	 T317813: [EPIC] Positive Reinforcement: Leveling Up  - https://phabricator.wikimedia.org/T317813
[13:07:14] <logmsgbot>	 !log kharlan@deploy2002 kharlan: Backport for [[gerrit:900196|GrowthExperiments: Enable LevelingUp features on testwiki (T317813)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet
[13:12:20] <wikibugs>	 (03CR) 10JMeybohm: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/899641 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm)
[13:14:45] <wikibugs>	 (03PS6) 10JMeybohm: k8s: Force to be explicit about k8s and calico versions [puppet] - 10https://gerrit.wikimedia.org/r/899641 (https://phabricator.wikimedia.org/T328291)
[13:14:48] <wikibugs>	 (03PS6) 10JMeybohm: k8s: Make profile::kubernetes::pki::intermediate mandatory [puppet] - 10https://gerrit.wikimedia.org/r/899642 (https://phabricator.wikimedia.org/T328291)
[13:14:49] <wikibugs>	 (03PS4) 10JMeybohm: k8s: Remove 1.16 related code [puppet] - 10https://gerrit.wikimedia.org/r/899652 (https://phabricator.wikimedia.org/T328291)
[13:15:28] <logmsgbot>	 !log kharlan@deploy2002 Finished scap: Backport for [[gerrit:900196|GrowthExperiments: Enable LevelingUp features on testwiki (T317813)]] (duration: 09m 48s)
[13:15:33] <stashbot>	 T317813: [EPIC] Positive Reinforcement: Leveling Up  - https://phabricator.wikimedia.org/T317813
[13:15:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:17:23] <wikibugs>	 (03Abandoned) 10JMeybohm: k8s: Force to be explicit about k8s and calico versions [puppet] - 10https://gerrit.wikimedia.org/r/899661 (https://phabricator.wikimedia.org/T328291) (owner: 10Jbond)
[13:19:48] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (NOOP 16 DIFF 15): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40163/console" [puppet] - 10https://gerrit.wikimedia.org/r/899641 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm)
[13:20:24] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 6 NOOP 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40164/console" [puppet] - 10https://gerrit.wikimedia.org/r/899642 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm)
[13:20:34] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40165/console" [puppet] - 10https://gerrit.wikimedia.org/r/899652 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm)
[13:20:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:22:24] <wikibugs>	 (03PS3) 10Kosta Harlan: GrowthExperiments: Remove unused GENewImpactD3Enabled flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/894593
[13:22:34] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/894593 (owner: 10Kosta Harlan)
[13:23:30] <wikibugs>	 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 9 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10fgiunchedi)
[13:23:44] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] wmflib::service::probe::module_options: add tests [puppet] - 10https://gerrit.wikimedia.org/r/899643 (owner: 10Jbond)
[13:23:50] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/899643 (owner: 10Jbond)
[13:26:34] <wikibugs>	 (03CR) 10JMeybohm: spark-operator: enable spark operator mutation webhook (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/897895 (https://phabricator.wikimedia.org/T331858) (owner: 10Nicolas Fraison)
[13:26:36] <wikibugs>	 (03Merged) 10jenkins-bot: GrowthExperiments: Remove unused GENewImpactD3Enabled flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/894593 (owner: 10Kosta Harlan)
[13:27:01] <logmsgbot>	 !log kharlan@deploy2002 Started scap: Backport for [[gerrit:894593|GrowthExperiments: Remove unused GENewImpactD3Enabled flag]]
[13:27:34] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-collab: Q3:rack/setup/install gerrit1003 - https://phabricator.wikimedia.org/T326366 (10Jclark-ctr) gerrit1003 B5 U13  port 4 Cableid 2988
[13:27:57] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-collab: Q3:rack/setup/install gerrit1003 - https://phabricator.wikimedia.org/T326366 (10Jclark-ctr)
[13:28:35] <logmsgbot>	 !log kharlan@deploy2002 kharlan: Backport for [[gerrit:894593|GrowthExperiments: Remove unused GENewImpactD3Enabled flag]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet
[13:28:36] <wikibugs>	 10SRE, 10SRE Observability, 10User-fgiunchedi: Webrequest Sampled Live on Superset shows data from only upload and not text CDN nodes - https://phabricator.wikimedia.org/T331801 (10fgiunchedi)
[13:28:58] <wikibugs>	 (03PS1) 10Jaime Nuche: docker::gc: update configuration to use latest version of images [puppet] - 10https://gerrit.wikimedia.org/r/900312
[13:29:00] <wikibugs>	 (03PS1) 10Jaime Nuche: deployment_server: clean up older images using systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/900313 (https://phabricator.wikimedia.org/T329678)
[13:30:31] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "Note that this will need changing the secrets in the puppetmasters to adapt to the new hiera key" [puppet] - 10https://gerrit.wikimedia.org/r/899724 (https://phabricator.wikimedia.org/T316323) (owner: 10FNegri)
[13:32:11] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] openstack: create openstack-ansible evaluation role [puppet] - 10https://gerrit.wikimedia.org/r/895789 (https://phabricator.wikimedia.org/T326758) (owner: 10Arturo Borrero Gonzalez)
[13:33:31] <wikibugs>	 (03Abandoned) 10Arturo Borrero Gonzalez: openstack: create openstack-ansible evaluation role [puppet] - 10https://gerrit.wikimedia.org/r/895789 (https://phabricator.wikimedia.org/T326758) (owner: 10Arturo Borrero Gonzalez)
[13:34:45] <logmsgbot>	 !log kharlan@deploy2002 Finished scap: Backport for [[gerrit:894593|GrowthExperiments: Remove unused GENewImpactD3Enabled flag]] (duration: 07m 44s)
[13:35:04] <kostajh>	 !log UTC afternoon deploys done
[13:35:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:35:17] <wikibugs>	 (03PS1) 10Jbond: netbox: fix minor lint issues and add test [puppet] - 10https://gerrit.wikimedia.org/r/900315
[13:35:19] <wikibugs>	 (03PS1) 10Jbond: Netbox: introduce support for validators [puppet] - 10https://gerrit.wikimedia.org/r/900316 (https://phabricator.wikimedia.org/T310590)
[13:35:21] <wikibugs>	 (03PS1) 10Jbond: netbox: add validators to canary host [puppet] - 10https://gerrit.wikimedia.org/r/900317 (https://phabricator.wikimedia.org/T310590)
[13:35:23] <wikibugs>	 (03PS1) 10Jbond: netbox: add validators to production host [puppet] - 10https://gerrit.wikimedia.org/r/900318 (https://phabricator.wikimedia.org/T310590)
[13:35:43] <wikibugs>	 (03CR) 10Jbond: "i think it would be better to control this via puppet so that we can for instance have different (new) validators on netbox-next vs netbox" [puppet] - 10https://gerrit.wikimedia.org/r/889959 (https://phabricator.wikimedia.org/T310590) (owner: 10Ayounsi)
[13:36:11] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] netbox: fix minor lint issues and add test [puppet] - 10https://gerrit.wikimedia.org/r/900315 (owner: 10Jbond)
[13:36:18] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Netbox: introduce support for validators [puppet] - 10https://gerrit.wikimedia.org/r/900316 (https://phabricator.wikimedia.org/T310590) (owner: 10Jbond)
[13:38:04] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] wmflib::service::probe::module_options: add tests [puppet] - 10https://gerrit.wikimedia.org/r/899643 (owner: 10Jbond)
[13:43:58] <wikibugs>	 (03PS5) 10Cathal Mooney: Adjust BFD Icinga check to handle SNMP connection failure [puppet] - 10https://gerrit.wikimedia.org/r/898849 (https://phabricator.wikimedia.org/T332080)
[13:45:06] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] Adjust BFD Icinga check to handle SNMP connection failure [puppet] - 10https://gerrit.wikimedia.org/r/898849 (https://phabricator.wikimedia.org/T332080) (owner: 10Cathal Mooney)
[13:46:47] <wikibugs>	 (03CR) 10Cathal Mooney: Adjust BFD Icinga check to handle SNMP connection failure (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/898849 (https://phabricator.wikimedia.org/T332080) (owner: 10Cathal Mooney)
[13:46:49] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Adjust BFD Icinga check to handle SNMP connection failure [puppet] - 10https://gerrit.wikimedia.org/r/898849 (https://phabricator.wikimedia.org/T332080) (owner: 10Cathal Mooney)
[13:51:28] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] wmcs: update ceph alerts dashboard [alerts] - 10https://gerrit.wikimedia.org/r/895220 (owner: 10David Caro)
[13:51:58] <wikibugs>	 (03PS3) 10Elukey: admin_ng: limit the experimental namespace to ml-staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/900240
[13:52:05] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10observability, 10Patch-For-Review: BFD Status Check Fails when device is unavailable - https://phabricator.wikimedia.org/T332080 (10cmooney) 05Open→03Resolved a:03cmooney
[13:55:07] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] admin_ng: limit the experimental namespace to ml-staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/900240 (owner: 10Elukey)
[13:55:29] <wikibugs>	 (03CR) 10Ayounsi: "Overall LGTM." [cookbooks] - 10https://gerrit.wikimedia.org/r/900242 (https://phabricator.wikimedia.org/T310590) (owner: 10Jbond)
[13:55:46] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10observability, 10Patch-For-Review: SNMP Network Checks throw exception when device is unreachable - https://phabricator.wikimedia.org/T332080 (10cmooney)
[13:56:12] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10observability, 10Patch-For-Review: SNMP Network Checks throw exception when device is unreachable - https://phabricator.wikimedia.org/T332080 (10cmooney) 05Resolved→03Open
[13:56:36] <wikibugs>	 (03CR) 10Cathal Mooney: Adjust OSPF Icinga check to ignore OSPFv3 if zero ints configured (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/899609 (https://phabricator.wikimedia.org/T315053) (owner: 10Cathal Mooney)
[13:56:39] <wikibugs>	 (03PS1) 10Andrew Bogott: rbd2backy2: log 'expire' string before trying to parse it [puppet] - 10https://gerrit.wikimedia.org/r/900323
[13:56:48] <wikibugs>	 (03PS3) 10Cathal Mooney: Adjust OSPF Icinga check to ignore OSPFv3 if zero ints configured [puppet] - 10https://gerrit.wikimedia.org/r/899609 (https://phabricator.wikimedia.org/T315053)
[13:56:58] <wikibugs>	 (03PS1) 10Ssingh: auditd: remove obsolete buster code [puppet] - 10https://gerrit.wikimedia.org/r/900324 (https://phabricator.wikimedia.org/T321309)
[13:57:56] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40168/console" [puppet] - 10https://gerrit.wikimedia.org/r/900324 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh)
[13:58:54] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] rbd2backy2: log 'expire' string before trying to parse it [puppet] - 10https://gerrit.wikimedia.org/r/900323 (owner: 10Andrew Bogott)
[13:59:15] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Adjust OSPF Icinga check to ignore OSPFv3 if zero ints configured [puppet] - 10https://gerrit.wikimedia.org/r/899609 (https://phabricator.wikimedia.org/T315053) (owner: 10Cathal Mooney)
[13:59:35] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1 C: 03+2] auditd: remove obsolete buster code [puppet] - 10https://gerrit.wikimedia.org/r/900324 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh)
[14:01:03] <wikibugs>	 10SRE-swift-storage, 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 10): Storage request: swift s3 bucket for mediawiki-page-content-change-enrichment checkpointing - https://phabricator.wikimedia.org/T330693 (10Eevans) >>! In T330693#8701662, @gmodena wrote: >>> [ ... ] >>>  >>> How woul...
[14:01:27] <wikibugs>	 (03PS2) 10Andrew Bogott: rbd2backy2: log 'expire' string before trying to parse it [puppet] - 10https://gerrit.wikimedia.org/r/900323
[14:06:05] <urandom>	 !log ALTER-ing image_suggestions.suggestion table — T328670
[14:06:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:06:11] <stashbot>	 T328670: Add section title column to image_suggestions.suggestions table schema - https://phabricator.wikimedia.org/T328670
[14:07:36] <wikibugs>	 (03CR) 10David Caro: rbd2backy2: log 'expire' string before trying to parse it (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/900323 (owner: 10Andrew Bogott)
[14:08:01] <wikibugs>	 (03PS3) 10Ssingh: dnsrecursor: drop support for buster and pdns-recursor < 4.6 [puppet] - 10https://gerrit.wikimedia.org/r/898957 (https://phabricator.wikimedia.org/T332083)
[14:09:50] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40169/console" [puppet] - 10https://gerrit.wikimedia.org/r/898957 (https://phabricator.wikimedia.org/T332083) (owner: 10Ssingh)
[14:13:21] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] Create and deploy per-CDN-site DNS domains [dns] - 10https://gerrit.wikimedia.org/r/899214 (owner: 10Jameel Kaisar)
[14:13:59] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] dnsrecursor: drop support for buster and pdns-recursor < 4.6 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/898957 (https://phabricator.wikimedia.org/T332083) (owner: 10Ssingh)
[14:17:23] <wikibugs>	 (03PS4) 10Elukey: admin_ng: limit the experimental namespace to ml-staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/900240
[14:20:40] <wikibugs>	 (03CR) 10Raymond Ndibe: [C: 03+1] "LGTM" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/897830 (owner: 10Slavina Stefanova)
[14:20:59] <wikibugs>	 (03CR) 10Raymond Ndibe: [C: 03+2] d/changelog: prepare for 0.92 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/897830 (owner: 10Slavina Stefanova)
[14:21:22] <wikibugs>	 (03PS8) 10Ayounsi: Add automation for EVPN BGP peerings [homer/public] - 10https://gerrit.wikimedia.org/r/894741 (https://phabricator.wikimedia.org/T327934) (owner: 10Cathal Mooney)
[14:22:36] <wikibugs>	 (03Merged) 10jenkins-bot: d/changelog: prepare for 0.92 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/897830 (owner: 10Slavina Stefanova)
[14:22:37] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] "ship it!" [homer/public] - 10https://gerrit.wikimedia.org/r/894741 (https://phabricator.wikimedia.org/T327934) (owner: 10Cathal Mooney)
[14:26:24] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] admin_ng: limit the experimental namespace to ml-staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/900240 (owner: 10Elukey)
[14:28:03] <wikibugs>	 (03PS2) 10Jbond: netbox: fix minor lint issues and add test [puppet] - 10https://gerrit.wikimedia.org/r/900315
[14:30:31] <wikibugs>	 (03PS1) 10Esanders: Enable DiscussionTools_visualenhancements_newsectionlink_enable on labs for testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900331
[14:31:12] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
[14:31:16] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
[14:37:08] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] netbox: fix minor lint issues and add test [puppet] - 10https://gerrit.wikimedia.org/r/900315 (owner: 10Jbond)
[14:37:26] <wikibugs>	 (03PS2) 10Jbond: Netbox: introduce support for validators [puppet] - 10https://gerrit.wikimedia.org/r/900316 (https://phabricator.wikimedia.org/T310590)
[14:38:01] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Netbox: introduce support for validators [puppet] - 10https://gerrit.wikimedia.org/r/900316 (https://phabricator.wikimedia.org/T310590) (owner: 10Jbond)
[14:39:54] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Urgent: Two failed disks in ms-be2040 - https://phabricator.wikimedia.org/T331860 (10MatthewVernon) 05Open→03Resolved @Papaul thanks; the other drive has behaved itself since the reboot, so I think we're OK to leave it in place for now.  [obviously it...
[14:40:44] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
[14:40:48] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[14:40:50] <logmsgbot>	 !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply
[14:40:57] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
[14:42:26] <wikibugs>	 (03CR) 10Ayounsi: Netbox: activate validators (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/889959 (https://phabricator.wikimedia.org/T310590) (owner: 10Ayounsi)
[14:44:35] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' .
[14:47:48] <wikibugs>	 (03PS3) 10Jbond: Netbox: introduce support for validators [puppet] - 10https://gerrit.wikimedia.org/r/900316 (https://phabricator.wikimedia.org/T310590)
[14:49:19] <wikibugs>	 (03PS1) 10Herron: kafka-logging: stop kafka services on kafka-logging1001 [puppet] - 10https://gerrit.wikimedia.org/r/900336 (https://phabricator.wikimedia.org/T326419)
[14:49:22] <wikibugs>	 (03PS1) 10Herron: kafka-logging: bring up kafka-logging1004 with node id 1004 [puppet] - 10https://gerrit.wikimedia.org/r/900337 (https://phabricator.wikimedia.org/T326419)
[14:49:38] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 32 hosts with reason: new_install
[14:50:01] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 32 hosts with reason: new_install
[14:50:06] <wikibugs>	 10SRE, 10serviceops: mw2420-mw2451 service implementation tracking - https://phabricator.wikimedia.org/T326363 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=33992616-b446-4bc5-bf17-27cb8c47e8d7) set by cgoubert@cumin1001 for 1:00:00 on 32 host(s) and their services with reason: new_instal...
[14:50:18] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10Papaul) @cmooney thank you for getting the table ready for the cloud nodes move.  As you can see on asw...
[14:50:49] <wikibugs>	 (03CR) 10Ahmon Dancy: "This looks like a reasonable and simple way to go." [puppet] - 10https://gerrit.wikimedia.org/r/900313 (https://phabricator.wikimedia.org/T329678) (owner: 10Jaime Nuche)
[14:51:19] <wikibugs>	 (03CR) 10Ayounsi: Add validator classes for some objects (032 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/889958 (https://phabricator.wikimedia.org/T310590) (owner: 10Ayounsi)
[14:51:35] <wikibugs>	 (03PS6) 10Ayounsi: Add validator classes for some objects [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/889958 (https://phabricator.wikimedia.org/T310590)
[14:51:59] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10observability, 10Patch-For-Review: SNMP Network Checks throw exception when device is unreachable - https://phabricator.wikimedia.org/T332080 (10cmooney) Looks like our OSPF check already handles any exceptions: ` cmooney@alert1001:~$ ./check_ospf.py --hos...
[14:54:07] <wikibugs>	 (03CR) 10Ahmon Dancy: "This change is ready for review." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/900312 (owner: 10Jaime Nuche)
[14:55:39] <wikibugs>	 (03CR) 10Herron: "the new kafka-logging hosts are a good opportunity to re-align hostnames with node ids where they overlap.  my high level plan is to set d" [puppet] - 10https://gerrit.wikimedia.org/r/900336 (https://phabricator.wikimedia.org/T326419) (owner: 10Herron)
[14:57:06] <wikibugs>	 (03PS6) 10Jbond: sre.netbox.deploy-extras: create a cookbook to deploy netbox-extras [cookbooks] - 10https://gerrit.wikimedia.org/r/900242 (https://phabricator.wikimedia.org/T310590)
[14:57:08] <wikibugs>	 (03PS1) 10Jbond: SREBatchBase: allow cookbooks to opt out of ensurable shell [cookbooks] - 10https://gerrit.wikimedia.org/r/900340
[14:57:33] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: session-c624.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:58:38] <wikibugs>	 10SRE, 10Observability-Logging, 10Release-Engineering-Team, 10Wikimedia-Logstash, 10SRE Observability (FY2022/2023-Q3): mediawiki-errors logstash dashboard's "errors over time" panel broken - https://phabricator.wikimedia.org/T332273 (10colewhite) The `MediaWiki errors over time by channel` visualization...
[14:59:19] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] SREBatchBase: allow cookbooks to opt out of ensurable shell [cookbooks] - 10https://gerrit.wikimedia.org/r/900340 (owner: 10Jbond)
[14:59:30] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.netbox.deploy-extras: create a cookbook to deploy netbox-extras [cookbooks] - 10https://gerrit.wikimedia.org/r/900242 (https://phabricator.wikimedia.org/T310590) (owner: 10Jbond)
[15:02:00] <wikibugs>	 (03CR) 10Jaime Nuche: deployment_server: clean up older images using systemd timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/900313 (https://phabricator.wikimedia.org/T329678) (owner: 10Jaime Nuche)
[15:02:12] <wikibugs>	 (03CR) 10Ahmon Dancy: deployment_server: clean up older images using systemd timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/900313 (https://phabricator.wikimedia.org/T329678) (owner: 10Jaime Nuche)
[15:02:51] <wikibugs>	 (03CR) 10Jbond: "thanks updated see inline re ensurable_shell" [cookbooks] - 10https://gerrit.wikimedia.org/r/900242 (https://phabricator.wikimedia.org/T310590) (owner: 10Jbond)
[15:05:07] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: session-c624.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:05:45] <wikibugs>	 (03PS2) 10Jbond: SREBatchBase: allow cookbooks to opt out of ensurable shell [cookbooks] - 10https://gerrit.wikimedia.org/r/900340
[15:05:47] <wikibugs>	 (03PS7) 10Jbond: sre.netbox.deploy-extras: create a cookbook to deploy netbox-extras [cookbooks] - 10https://gerrit.wikimedia.org/r/900242 (https://phabricator.wikimedia.org/T310590)
[15:05:58] <wikibugs>	 (03PS5) 10JMeybohm: k8s: Remove 1.16 related code [puppet] - 10https://gerrit.wikimedia.org/r/899652 (https://phabricator.wikimedia.org/T328291)
[15:07:23] <wikibugs>	 (03PS3) 10Jbond: SREBatchBase: allow cookbooks to opt out of ensurable shell [cookbooks] - 10https://gerrit.wikimedia.org/r/900340
[15:08:09] <wikibugs>	 (03PS2) 10Jbond: netbox: add validators to canary host [puppet] - 10https://gerrit.wikimedia.org/r/900317 (https://phabricator.wikimedia.org/T310590)
[15:08:27] <wikibugs>	 (03PS2) 10Jbond: netbox: add validators to production host [puppet] - 10https://gerrit.wikimedia.org/r/900318 (https://phabricator.wikimedia.org/T310590)
[15:09:25] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for 32 hosts
[15:09:27] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40171/console" [puppet] - 10https://gerrit.wikimedia.org/r/900317 (https://phabricator.wikimedia.org/T310590) (owner: 10Jbond)
[15:09:35] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for 32 hosts
[15:10:06] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] Create and deploy per-CDN-site DNS domains (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/899214 (owner: 10Jameel Kaisar)
[15:10:09] <sukhe>	 !log disable puppet on R:class = dnsrecursor to merge CR: 898957
[15:10:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:10:13] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40170/console" [puppet] - 10https://gerrit.wikimedia.org/r/899652 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm)
[15:10:19] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40172/console" [puppet] - 10https://gerrit.wikimedia.org/r/900318 (https://phabricator.wikimedia.org/T310590) (owner: 10Jbond)
[15:11:36] <logmsgbot>	 !log cgoubert@cumin1001 conftool action : set/weight=30; selector: name=mw24[2345].*.codfw.wmnet,cluster=appserver
[15:11:55] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1 C: 03+2] dnsrecursor: drop support for buster and pdns-recursor < 4.6 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/898957 (https://phabricator.wikimedia.org/T332083) (owner: 10Ssingh)
[15:11:59] <logmsgbot>	 !log cgoubert@cumin1001 conftool action : set/weight=30; selector: name=mw24[2345].*.codfw.wmnet,cluster=api_appserver
[15:12:22] <sukhe>	 jbond: ok to merge yours?
[15:12:27] <sukhe>	 Jbond: netbox: fix minor lint issues and add test (7b3e6603b3)
[15:12:38] <logmsgbot>	 !log cgoubert@cumin1001 conftool action : set/weight=25; selector: name=mw24[2345].*.codfw.wmnet,cluster=jobrunner
[15:13:14] <logmsgbot>	 !log cgoubert@cumin1001 conftool action : set/weight=25; selector: name=mw24[2345].*.codfw.wmnet,cluster=videoscaler
[15:13:43] <claime>	 jouncebot: nowandnext
[15:13:43] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 46 minute(s)
[15:13:43] <jouncebot>	 In 0 hour(s) and 46 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230316T1600)
[15:14:11] <wikibugs>	 (03PS8) 10Jbond: sre.netbox.deploy-extras: create a cookbook to deploy netbox-extras [cookbooks] - 10https://gerrit.wikimedia.org/r/900242 (https://phabricator.wikimedia.org/T310590)
[15:14:16] <wikibugs>	 (03CR) 10Jbond: sre.netbox.deploy-extras: create a cookbook to deploy netbox-extras (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/900242 (https://phabricator.wikimedia.org/T310590) (owner: 10Jbond)
[15:14:36] <wikibugs>	 (03PS9) 10Jbond: sre.netbox.deploy-extras: create a cookbook to deploy netbox-extras [cookbooks] - 10https://gerrit.wikimedia.org/r/900242 (https://phabricator.wikimedia.org/T310590)
[15:15:15] <claime>	 !log Pooling new mw hosts mw24[20-51].codfw.wmnet - T326363
[15:15:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:15:20] <stashbot>	 T326363: mw2420-mw2451 service implementation tracking - https://phabricator.wikimedia.org/T326363
[15:15:50] <logmsgbot>	 !log cgoubert@cumin1001 conftool action : set/pooled=yes; selector: name=mw24[2345].*.codfw.wmnet,cluster=appserver
[15:16:47] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: session-c624.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:17:43] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM although I'm not sure if we want to enable this behaviour." [cookbooks] - 10https://gerrit.wikimedia.org/r/900340 (owner: 10Jbond)
[15:18:37] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw2422 is CRITICAL: Host mw2422 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[15:18:37] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw2423 is CRITICAL: Host mw2423 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[15:18:37] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw2424 is CRITICAL: Host mw2424 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[15:18:37] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw2426 is CRITICAL: Host mw2426 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[15:18:37] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw2427 is CRITICAL: Host mw2427 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[15:18:38] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw2428 is CRITICAL: Host mw2428 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[15:18:38] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw2429 is CRITICAL: Host mw2429 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[15:18:39] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw2430 is CRITICAL: Host mw2430 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[15:18:39] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw2434 is CRITICAL: Host mw2434 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[15:18:40] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw2435 is CRITICAL: Host mw2435 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[15:18:40] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw2436 is CRITICAL: Host mw2436 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[15:18:41] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw2437 is CRITICAL: Host mw2437 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[15:18:41] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw2440 is CRITICAL: Host mw2440 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[15:18:42] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw2442 is CRITICAL: Host mw2442 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[15:18:42] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw2443 is CRITICAL: Host mw2443 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[15:18:42] <claime>	 Expected.
[15:18:43] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw2444 is CRITICAL: Host mw2444 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[15:18:43] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw2445 is CRITICAL: Host mw2445 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[15:18:44] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw2446 is CRITICAL: Host mw2446 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[15:18:44] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw2450 is CRITICAL: Host mw2450 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[15:18:45] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw2451 is CRITICAL: Host mw2451 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[15:18:46] <sukhe>	 claime: phew :P
[15:19:50] <logmsgbot>	 !log cgoubert@cumin1001 conftool action : set/pooled=yes; selector: name=mw24[2345].*.codfw.wmnet,cluster=api_appserver
[15:20:17] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Migrate PKI servers to Bullseye - https://phabricator.wikimedia.org/T331696 (10Jhancock.wm)
[15:20:51] <wikibugs>	 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations, 10decommission-hardware, 10Patch-For-Review: decommission pki2001.codfw.wmnet - https://phabricator.wikimedia.org/T332018 (10Jhancock.wm) 05Open→03Resolved
[15:22:57] <icinga-wm>	 RECOVERY - mediawiki-installation DSH group on mw2422 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[15:22:57] <icinga-wm>	 RECOVERY - mediawiki-installation DSH group on mw2423 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[15:22:57] <icinga-wm>	 RECOVERY - mediawiki-installation DSH group on mw2424 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[15:22:59] <icinga-wm>	 RECOVERY - mediawiki-installation DSH group on mw2434 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[15:22:59] <icinga-wm>	 RECOVERY - mediawiki-installation DSH group on mw2435 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[15:22:59] <icinga-wm>	 RECOVERY - mediawiki-installation DSH group on mw2436 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[15:22:59] <icinga-wm>	 RECOVERY - mediawiki-installation DSH group on mw2437 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[15:22:59] <icinga-wm>	 RECOVERY - mediawiki-installation DSH group on mw2440 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[15:22:59] <icinga-wm>	 RECOVERY - mediawiki-installation DSH group on mw2442 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[15:23:00] <icinga-wm>	 RECOVERY - mediawiki-installation DSH group on mw2443 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[15:23:00] <icinga-wm>	 RECOVERY - mediawiki-installation DSH group on mw2450 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[15:23:01] <icinga-wm>	 RECOVERY - mediawiki-installation DSH group on mw2451 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[15:23:14] <claime>	 There's gonna be a bit more flooding and then I'll be done :D
[15:23:34] <logmsgbot>	 !log cgoubert@cumin1001 conftool action : set/pooled=yes; selector: name=mw24[2345].*.codfw.wmnet,cluster=jobrunner
[15:23:45] <logmsgbot>	 !log cgoubert@cumin1001 conftool action : set/pooled=yes; selector: name=mw24[2345].*.codfw.wmnet,cluster=videoscaler
[15:25:45] <wikibugs>	 (03PS2) 10Jaime Nuche: docker::gc: update configuration to use latest version of images [puppet] - 10https://gerrit.wikimedia.org/r/900312
[15:25:47] <wikibugs>	 (03PS3) 10Jaime Nuche: deployment_server: clean up older images using systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/900313 (https://phabricator.wikimedia.org/T329678)
[15:25:49] <wikibugs>	 (03PS1) 10Jaime Nuche: deployment_server: ensure Docker is installed [puppet] - 10https://gerrit.wikimedia.org/r/900353
[15:25:52] <wikibugs>	 (03CR) 10Nicolas Fraison: spark-operator: enable spark operator mutation webhook (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/897895 (https://phabricator.wikimedia.org/T331858) (owner: 10Nicolas Fraison)
[15:26:11] <wikibugs>	 10SRE, 10sre-unowned: Migrate mw_rc_irc servers to Bullseye - https://phabricator.wikimedia.org/T331702 (10MoritzMuehlenhoff)
[15:26:17] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: session-c624.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:26:23] <wikibugs>	 10SRE, 10sre-unowned: Migrate Kafka test cluster to Bullseye - https://phabricator.wikimedia.org/T331710 (10MoritzMuehlenhoff)
[15:27:12] <wikibugs>	 (03CR) 10Jaime Nuche: docker::gc: update configuration to use latest version of images (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/900312 (owner: 10Jaime Nuche)
[15:27:53] <wikibugs>	 (03PS1) 10Ayounsi: Test, add validator directory [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/900354
[15:28:05] <wikibugs>	 (03PS3) 10Andrew Bogott: rbd2backy2: log 'expire' string before trying to parse it [puppet] - 10https://gerrit.wikimedia.org/r/900323
[15:28:07] <wikibugs>	 (03CR) 10Andrew Bogott: rbd2backy2: log 'expire' string before trying to parse it (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/900323 (owner: 10Andrew Bogott)
[15:28:12] <sukhe>	 !log enable puppet on R:class = dnsrecursor to merge CR: 898957 [done]
[15:28:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:30:18] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Test, add validator directory [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/900354 (owner: 10Ayounsi)
[15:30:20] <wikibugs>	 (03CR) 10Jaime Nuche: "PCC: https://puppet-compiler.wmflabs.org/output/900313/40167/" [puppet] - 10https://gerrit.wikimedia.org/r/900313 (https://phabricator.wikimedia.org/T329678) (owner: 10Jaime Nuche)
[15:30:23] <wikibugs>	 (03CR) 10Nicolas Fraison: spark-operator: enable spark operator mutation webhook (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/897895 (https://phabricator.wikimedia.org/T331858) (owner: 10Nicolas Fraison)
[15:32:03] <wikibugs>	 (03CR) 10Jbond: Netbox: activate validators (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/889959 (https://phabricator.wikimedia.org/T310590) (owner: 10Ayounsi)
[15:32:03] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: session-c624.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:32:17] <icinga-wm>	 RECOVERY - mediawiki-installation DSH group on mw2426 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[15:32:17] <icinga-wm>	 RECOVERY - mediawiki-installation DSH group on mw2427 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[15:32:17] <icinga-wm>	 RECOVERY - mediawiki-installation DSH group on mw2428 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[15:32:17] <icinga-wm>	 RECOVERY - mediawiki-installation DSH group on mw2429 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[15:32:17] <icinga-wm>	 RECOVERY - mediawiki-installation DSH group on mw2430 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[15:32:18] <icinga-wm>	 RECOVERY - mediawiki-installation DSH group on mw2444 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[15:32:18] <icinga-wm>	 RECOVERY - mediawiki-installation DSH group on mw2445 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[15:32:19] <icinga-wm>	 RECOVERY - mediawiki-installation DSH group on mw2446 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups
[15:33:18] <claime>	 sukhe: There, no more flooding :D
[15:34:14] <sukhe>	 claime: thanks! 
[15:36:56] <wikibugs>	 10SRE, 10Traffic: Clean up and refactor the dnsrecursor module - https://phabricator.wikimedia.org/T332083 (10ssingh) 05Open→03Resolved a:03ssingh This has been resolved with the https://gerrit.wikimedia.org/r/898957 and all `R:Class = dnsrecursor` hosts running bullseye:  ` sukhe@cumin2002:~$ sudo cumin...
[15:37:02] <wikibugs>	 (03CR) 10Jaime Nuche: "PCC: https://puppet-compiler.wmflabs.org/output/900312/40173/" [puppet] - 10https://gerrit.wikimedia.org/r/900312 (owner: 10Jaime Nuche)
[15:37:49] <wikibugs>	 (03CR) 10Ayounsi: sre.netbox.deploy-extras: create a cookbook to deploy netbox-extras (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/900242 (https://phabricator.wikimedia.org/T310590) (owner: 10Jbond)
[15:39:50] <claime>	 !log Pooled new mw hosts mw24[20-51].codfw.wmnet - T326363
[15:39:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:39:55] <stashbot>	 T326363: mw2420-mw2451 service implementation tracking - https://phabricator.wikimedia.org/T326363
[15:40:04] <wikibugs>	 (03CR) 10JMeybohm: spark-operator: enable spark operator mutation webhook (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/897895 (https://phabricator.wikimedia.org/T331858) (owner: 10Nicolas Fraison)
[15:41:06] <wikibugs>	 (03PS1) 10Cathal Mooney: Modify netops Icinga checks to gracefully deal with SNMP timeout [puppet] - 10https://gerrit.wikimedia.org/r/900360 (https://phabricator.wikimedia.org/T332080)
[15:41:45] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Modify netops Icinga checks to gracefully deal with SNMP timeout [puppet] - 10https://gerrit.wikimedia.org/r/900360 (https://phabricator.wikimedia.org/T332080) (owner: 10Cathal Mooney)
[15:43:22] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] Netbox: introduce support for validators [puppet] - 10https://gerrit.wikimedia.org/r/900316 (https://phabricator.wikimedia.org/T310590) (owner: 10Jbond)
[15:43:57] <wikibugs>	 (03CR) 10Ayounsi: Netbox: activate validators (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/889959 (https://phabricator.wikimedia.org/T310590) (owner: 10Ayounsi)
[15:44:21] <wikibugs>	 (03Abandoned) 10Ayounsi: Netbox: activate validators [puppet] - 10https://gerrit.wikimedia.org/r/889959 (https://phabricator.wikimedia.org/T310590) (owner: 10Ayounsi)
[15:44:31] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] wmcs: update ceph alerts dashboard [alerts] - 10https://gerrit.wikimedia.org/r/895220 (owner: 10David Caro)
[15:44:34] <wikibugs>	 (03PS3) 10David Caro: wmcs: update ceph alerts dashboard [alerts] - 10https://gerrit.wikimedia.org/r/895220
[15:46:47] <wikibugs>	 (03PS2) 10Cathal Mooney: Modify netops Icinga checks to gracefully deal with SNMP timeout [puppet] - 10https://gerrit.wikimedia.org/r/900360 (https://phabricator.wikimedia.org/T332080)
[15:46:59] <wikibugs>	 10SRE, 10serviceops: mw2420-mw2451 service implementation tracking - https://phabricator.wikimedia.org/T326363 (10Clement_Goubert) All done ` {"mw2422.codfw.wmnet": {"weight": 30, "pooled": "yes"}, "tags": "dc=codfw,cluster=api_appserver,service=nginx"} {"mw2423.codfw.wmnet": {"weight": 30, "pooled": "yes"}, "...
[15:47:06] <wikibugs>	 (03PS10) 10Jbond: sre.netbox.deploy-extras: create a cookbook to deploy netbox-extras [cookbooks] - 10https://gerrit.wikimedia.org/r/900242 (https://phabricator.wikimedia.org/T310590)
[15:47:25] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Modify netops Icinga checks to gracefully deal with SNMP timeout [puppet] - 10https://gerrit.wikimedia.org/r/900360 (https://phabricator.wikimedia.org/T332080) (owner: 10Cathal Mooney)
[15:47:27] <wikibugs>	 (03CR) 10Nicolas Fraison: spark-operator: enable spark operator mutation webhook (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/897895 (https://phabricator.wikimedia.org/T331858) (owner: 10Nicolas Fraison)
[15:47:38] <wikibugs>	 (03Abandoned) 10Jbond: SREBatchBase: allow cookbooks to opt out of ensurable shell [cookbooks] - 10https://gerrit.wikimedia.org/r/900340 (owner: 10Jbond)
[15:48:07] <wikibugs>	 (03CR) 10Jbond: SREBatchBase: allow cookbooks to opt out of ensurable shell (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/900340 (owner: 10Jbond)
[15:48:10] <wikibugs>	 (03CR) 10Jaime Nuche: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/900353 (owner: 10Jaime Nuche)
[15:48:14] <wikibugs>	 (03PS11) 10Jbond: sre.netbox.deploy-extras: create a cookbook to deploy netbox-extras [cookbooks] - 10https://gerrit.wikimedia.org/r/900242 (https://phabricator.wikimedia.org/T310590)
[15:49:25] <wikibugs>	 (03CR) 10Ayounsi: sre.netbox.deploy-extras: create a cookbook to deploy netbox-extras (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/900242 (https://phabricator.wikimedia.org/T310590) (owner: 10Jbond)
[15:49:49] <wikibugs>	 (03PS3) 10Cathal Mooney: Modify netops Icinga checks to gracefully deal with SNMP timeout [puppet] - 10https://gerrit.wikimedia.org/r/900360 (https://phabricator.wikimedia.org/T332080)
[15:50:26] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Modify netops Icinga checks to gracefully deal with SNMP timeout [puppet] - 10https://gerrit.wikimedia.org/r/900360 (https://phabricator.wikimedia.org/T332080) (owner: 10Cathal Mooney)
[15:50:32] <wikibugs>	 (03PS12) 10Nicolas Fraison: spark-operator: enable spark operator mutation webhook [deployment-charts] - 10https://gerrit.wikimedia.org/r/897895 (https://phabricator.wikimedia.org/T331858)
[15:50:34] <wikibugs>	 (03PS5) 10Nicolas Fraison: spark: Allow communication from spark pods to HDFS/Hive [deployment-charts] - 10https://gerrit.wikimedia.org/r/899630 (https://phabricator.wikimedia.org/T331859)
[15:51:35] <wikibugs>	 (03CR) 10Ahmon Dancy: deployment_server: ensure Docker is installed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/900353 (owner: 10Jaime Nuche)
[15:51:41] <wikibugs>	 (03CR) 10JMeybohm: "The update to 1.23 accidentally enabled the IPv6DualStack feature gate in kubelet even for clusters with "profile::kubernetes::ipv6dualsta" [puppet] - 10https://gerrit.wikimedia.org/r/899652 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm)
[15:53:11] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+1] deployment_server: clean up older images using systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/900313 (https://phabricator.wikimedia.org/T329678) (owner: 10Jaime Nuche)
[15:53:25] <wikibugs>	 (03PS4) 10Cathal Mooney: Modify netops Icinga checks to gracefully deal with SNMP timeout [puppet] - 10https://gerrit.wikimedia.org/r/900360 (https://phabricator.wikimedia.org/T332080)
[15:54:21] <wikibugs>	 (03PS12) 10Jbond: sre.netbox.deploy-extras: create a cookbook to deploy netbox-extras [cookbooks] - 10https://gerrit.wikimedia.org/r/900242 (https://phabricator.wikimedia.org/T310590)
[15:54:42] <wikibugs>	 (03CR) 10Jaime Nuche: deployment_server: ensure Docker is installed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/900353 (owner: 10Jaime Nuche)
[15:55:10] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+1] docker::gc: update configuration to use latest version of images [puppet] - 10https://gerrit.wikimedia.org/r/900312 (owner: 10Jaime Nuche)
[15:55:45] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+1] deployment_server: ensure Docker is installed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/900353 (owner: 10Jaime Nuche)
[15:56:38] <wikibugs>	 (03CR) 10Jbond: sre.netbox.deploy-extras: create a cookbook to deploy netbox-extras (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/900242 (https://phabricator.wikimedia.org/T310590) (owner: 10Jbond)
[15:57:00] <wikibugs>	 (03PS13) 10Jbond: sre.netbox.deploy-extras: create a cookbook to deploy netbox-extras [cookbooks] - 10https://gerrit.wikimedia.org/r/900242 (https://phabricator.wikimedia.org/T310590)
[15:57:33] <wikibugs>	 (03PS5) 10Jameel Kaisar: Create and deploy per-CDN-site DNS domains [dns] - 10https://gerrit.wikimedia.org/r/899214 (https://phabricator.wikimedia.org/T332025)
[15:59:38] <wikibugs>	 (03PS2) 10Jaime Nuche: deployment_server: ensure Docker is installed [puppet] - 10https://gerrit.wikimedia.org/r/900353 (https://phabricator.wikimedia.org/T329622)
[16:00:04] <jouncebot>	 jbond and rzl: That opportune time is upon us again. Time for a Puppet request window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230316T1600).
[16:00:04] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[16:02:53] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: session-c624.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:03:15] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10cmooney) >>! In T327919#8702610, @Papaul wrote: > I know the first 4 ports on cloudsw1-b1 are set up as...
[16:04:53] <icinga-wm>	 RECOVERY - cinder-volume process on cloudcontrol1007 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python.* /usr/bin/cinder-volume https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[16:06:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PUT deployments) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[16:07:36] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/899214 (https://phabricator.wikimedia.org/T332025) (owner: 10Jameel Kaisar)
[16:09:54] <wikibugs>	 (03PS14) 10Jbond: sre.netbox.deploy-extras: create a cookbook to deploy netbox-extras [cookbooks] - 10https://gerrit.wikimedia.org/r/900242 (https://phabricator.wikimedia.org/T310590)
[16:11:38] <wikibugs>	 (03PS1) 10Btullis: Upgrade Airflow on the platform_eng instance and switch to PostgreSQL [puppet] - 10https://gerrit.wikimedia.org/r/900366 (https://phabricator.wikimedia.org/T326193)
[16:11:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT deployments) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[16:13:01] <wikibugs>	 (03PS15) 10Jbond: sre.netbox.deploy-extras: create a cookbook to deploy netbox-extras [cookbooks] - 10https://gerrit.wikimedia.org/r/900242 (https://phabricator.wikimedia.org/T310590)
[16:13:52] <wikibugs>	 (03PS1) 10EoghanGaffney: Allow E_DEPRECATED logs to be shown on php-fpm in doc machines [puppet] - 10https://gerrit.wikimedia.org/r/900369 (https://phabricator.wikimedia.org/T325245)
[16:14:16] <wikibugs>	 (03PS1) 10Jbond: Revert "Test, add validator directory" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/900138
[16:14:19] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Allow E_DEPRECATED logs to be shown on php-fpm in doc machines [puppet] - 10https://gerrit.wikimedia.org/r/900369 (https://phabricator.wikimedia.org/T325245) (owner: 10EoghanGaffney)
[16:14:55] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] Revert "Test, add validator directory" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/900138 (owner: 10Jbond)
[16:15:22] <wikibugs>	 (03PS2) 10EoghanGaffney: Allow E_DEPRECATED logs to be shown on php-fpm in doc machines [puppet] - 10https://gerrit.wikimedia.org/r/900369 (https://phabricator.wikimedia.org/T325245)
[16:16:20] <wikibugs>	 (03PS1) 10Jbond: test validators with multiple files [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/900371
[16:16:37] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] test validators with multiple files [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/900371 (owner: 10Jbond)
[16:21:04] <wikibugs>	 (03PS16) 10Jbond: sre.netbox.deploy-extras: create a cookbook to deploy netbox-extras [cookbooks] - 10https://gerrit.wikimedia.org/r/900242 (https://phabricator.wikimedia.org/T310590)
[16:21:33] <wikibugs>	 (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40175/console" [puppet] - 10https://gerrit.wikimedia.org/r/900369 (https://phabricator.wikimedia.org/T325245) (owner: 10EoghanGaffney)
[16:21:42] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to analytics_privatedata_users for FNavas-foundation - https://phabricator.wikimedia.org/T331482 (10FNavas-foundation) Hi @MatthewVernon, thanks for picking this up - I do need Turnilo access. I also need access to a special dashboard created by @Pablo.  Sorry about...
[16:21:44] <wikibugs>	 (03PS1) 10Jbond: Revert "test validators with multiple files" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/900139
[16:21:49] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] Revert "test validators with multiple files" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/900139 (owner: 10Jbond)
[16:22:40] <wikibugs>	 (03PS1) 10EoghanGaffney: Adds php and apache logs for doc machines [puppet] - 10https://gerrit.wikimedia.org/r/900375 (https://phabricator.wikimedia.org/T325245)
[16:22:55] <wikibugs>	 (03CR) 10Jbond: "ready for review: tested and lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/900242 (https://phabricator.wikimedia.org/T310590) (owner: 10Jbond)
[16:23:30] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] Netbox: introduce support for validators [puppet] - 10https://gerrit.wikimedia.org/r/900316 (https://phabricator.wikimedia.org/T310590) (owner: 10Jbond)
[16:24:53] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40176/console" [puppet] - 10https://gerrit.wikimedia.org/r/900316 (https://phabricator.wikimedia.org/T310590) (owner: 10Jbond)
[16:25:58] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] exim: remove wikimedia.com from wikimedia_domains [puppet] - 10https://gerrit.wikimedia.org/r/898896 (https://phabricator.wikimedia.org/T331676) (owner: 10Dzahn)
[16:29:17] <wikibugs>	 (03PS4) 10Andrew Bogott: rbd2backy2: log 'expire' string before trying to parse it [puppet] - 10https://gerrit.wikimedia.org/r/900323
[16:29:19] <wikibugs>	 (03PS1) 10Andrew Bogott: Trove: increase volume formats a whole lot [puppet] - 10https://gerrit.wikimedia.org/r/900379
[16:30:54] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2067.codfw.wmnet
[16:30:59] <wikibugs>	 (03CR) 10Jameel Kaisar: "Done" [dns] - 10https://gerrit.wikimedia.org/r/899214 (https://phabricator.wikimedia.org/T332025) (owner: 10Jameel Kaisar)
[16:31:11] <Emperor>	 !log reboot ms-be2067 again to see if the missing drive comes back
[16:31:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:31:49] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: session-c624.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:31:57] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] rbd2backy2: log 'expire' string before trying to parse it [puppet] - 10https://gerrit.wikimedia.org/r/900323 (owner: 10Andrew Bogott)
[16:32:06] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Trove: increase volume formats a whole lot [puppet] - 10https://gerrit.wikimedia.org/r/900379 (owner: 10Andrew Bogott)
[16:32:10] <wikibugs>	 (03PS1) 10Elukey: ml-services: limit deployments of experimental to ml-staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/900381
[16:35:20] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] ml-services: limit deployments of experimental to ml-staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/900381 (owner: 10Elukey)
[16:36:17] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:37:29] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: session-c624.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:45:49] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:45:57] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Two failed disks in ms-be2067 - https://phabricator.wikimedia.org/T331030 (10MatthewVernon) @Papaul that's the one - can you clear the Foreign state from that disk, please? I can't figure out how to do it, and I think without that config being cleared it's...
[16:46:03] <icinga-wm>	 PROBLEM - pybal on lvs4010 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal
[16:46:15] <sukhe>	 ^ expected, rebooting
[16:46:23] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs4010 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal
[16:46:33] <icinga-wm>	 PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:46:55] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on lvs4010.ulsfo.wmnet with reason: rebooting for kernel updates
[16:47:11] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on lvs4010.ulsfo.wmnet with reason: rebooting for kernel updates
[16:47:13] <icinga-wm>	 PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:50:02] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10Papaul) cloudservices2004-dev = U37 cloudservices2005-dev = U38 cloudweb2002-dev = 39  yes we already o...
[16:53:03] <icinga-wm>	 RECOVERY - BGP status on cr3-ulsfo is OK: BGP OK - up: 91, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:53:47] <icinga-wm>	 RECOVERY - pybal on lvs4010 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal
[16:54:07] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs4010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[16:54:17] <icinga-wm>	 RECOVERY - BGP status on cr4-ulsfo is OK: BGP OK - up: 107, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[16:56:05] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.remove-downtime for lvs4010.ulsfo.wmnet
[16:56:05] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs4010.ulsfo.wmnet
[16:56:32] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to analytics_privatedata_users for FNavas-foundation - https://phabricator.wikimedia.org/T331482 (10Aklapper) > I made another user with my WMF mediawiki account (FNavas-foundation), which is linked to my foundation email. On a related note, in staff capacity I'd rec...
[16:56:52] <wikibugs>	 (03CR) 10Bking: [V: 03+1 C: 03+2] wdqs: export more jmx metrics to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/898687 (https://phabricator.wikimedia.org/T331405) (owner: 10DCausse)
[16:56:53] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: session-c624.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:57:06] <jinxer-wm>	 (CirrusSearchJobQueueBacklogTooBig) firing: CirrusSearch job topic codfw.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 218k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig
[16:58:46] <logmsgbot>	 !log xcollazo@deploy2002 Started deploy [airflow-dags/platform_eng@e17ee96]: First deploy after Airflow 2.5.1 upgrade.
[16:59:11] <logmsgbot>	 !log xcollazo@deploy2002 Finished deploy [airflow-dags/platform_eng@e17ee96]: First deploy after Airflow 2.5.1 upgrade. (duration: 00m 24s)
[17:01:19] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Upgrade Airflow on the platform_eng instance and switch to PostgreSQL [puppet] - 10https://gerrit.wikimedia.org/r/900366 (https://phabricator.wikimedia.org/T326193) (owner: 10Btullis)
[17:02:29] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: session-c624.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:05:41] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 0:15:00 on lvs4008.ulsfo.wmnet with reason: rebooting for kernel updates
[17:05:56] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on lvs4008.ulsfo.wmnet with reason: rebooting for kernel updates
[17:06:23] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: session-c624.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:09:45] <icinga-wm>	 PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:10:11] <sukhe>	 expected
[17:10:25] <icinga-wm>	 PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:11:34] <wikibugs>	 (03PS1) 10Hnowlan: thumbor: bump workers, reduce CPU, increase queue [deployment-charts] - 10https://gerrit.wikimedia.org/r/900388 (https://phabricator.wikimedia.org/T328033)
[17:12:03] <wikibugs>	 (03PS1) 10Cmjohnson: Adding ms-fe1013-4 and thanos-fe1004 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/898922 (https://phabricator.wikimedia.org/T326846)
[17:12:06] <jinxer-wm>	 (CirrusSearchJobQueueBacklogTooBig) resolved: CirrusSearch job topic codfw.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 202k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig
[17:12:09] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: session-c624.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:13:14] <wikibugs>	 (03CR) 10Cmjohnson: [C: 03+2] Adding ms-fe1013-4 and thanos-fe1004 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/898922 (https://phabricator.wikimedia.org/T326846) (owner: 10Cmjohnson)
[17:17:23] <wikibugs>	 (03CR) 10FNegri: [C: 03+2] [tbs.harbor] Clean up admin pwd management [puppet] - 10https://gerrit.wikimedia.org/r/899724 (https://phabricator.wikimedia.org/T316323) (owner: 10FNegri)
[17:19:40] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10cmooney) Ok cool.  So I'd propose we take it like this:  **1. Move sretest2001 from port xe-0/0/1 to xe...
[17:19:51] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: session-c624.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:21:37] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host ms-fe1013.eqiad.wmnet with OS bullseye
[17:21:50] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004 - https://phabricator.wikimedia.org/T326846 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ms-fe1013.eqiad.wmnet with...
[17:21:51] <icinga-wm>	 PROBLEM - pybal on lvs4008 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal
[17:22:06] <sukhe>	 ^ expected
[17:22:17] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs4008 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal
[17:22:33] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs4008 is CRITICAL: CRITICAL: 0 connections established with conf2006.codfw.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal
[17:22:42] <wikibugs>	 (03PS1) 10Sharvaniharan: Remove sampling from breadCrumbs schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900396
[17:22:49] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Remove sampling from breadCrumbs schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900396 (owner: 10Sharvaniharan)
[17:26:09] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs4008 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal
[17:27:39] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: session-c624.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:27:41] <icinga-wm>	 PROBLEM - pybal on lvs4008 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal
[17:29:03] <icinga-wm>	 RECOVERY - BGP status on cr4-ulsfo is OK: BGP OK - up: 107, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:29:37] <icinga-wm>	 RECOVERY - pybal on lvs4008 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal
[17:29:45] <icinga-wm>	 RECOVERY - BGP status on cr3-ulsfo is OK: BGP OK - up: 91, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:30:03] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs4008 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[17:30:49] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host thanos-fe1004.eqiad.wmnet with OS bullseye
[17:30:57] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004 - https://phabricator.wikimedia.org/T326846 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host thanos-fe1004.eqiad.wmnet with OS bullseye
[17:31:42] <wikibugs>	 (03PS1) 10Sharvaniharan: Remove sampling from breadCrumbs schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900399
[17:31:52] <wikibugs>	 (03Abandoned) 10Sharvaniharan: Remove sampling from breadCrumbs schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900396 (owner: 10Sharvaniharan)
[17:32:51] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] "Awesome!" [cookbooks] - 10https://gerrit.wikimedia.org/r/900242 (https://phabricator.wikimedia.org/T310590) (owner: 10Jbond)
[17:32:53] <wikibugs>	 (03PS1) 10FNegri: [tbs.harbor] Remove duplicate pwd [puppet] - 10https://gerrit.wikimedia.org/r/900400 (https://phabricator.wikimedia.org/T316323)
[17:33:27] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: session-c624.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:34:23] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs4008 is OK: OK: 12 connections established with conf2006.codfw.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal
[17:35:01] <wikibugs>	 (03Merged) 10jenkins-bot: sre.netbox.deploy-extras: create a cookbook to deploy netbox-extras [cookbooks] - 10https://gerrit.wikimedia.org/r/900242 (https://phabricator.wikimedia.org/T310590) (owner: 10Jbond)
[17:35:04] <wikibugs>	 (03PS1) 10David Caro: harbor: don't use https, use web proxies instead [puppet] - 10https://gerrit.wikimedia.org/r/900401
[17:36:22] <logmsgbot>	 !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host thanos-fe1004.eqiad.wmnet with OS bullseye
[17:36:28] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004 - https://phabricator.wikimedia.org/T326846 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host thanos-fe1004.eqiad.wmnet with OS bullseye executed w...
[17:37:08] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] harbor: don't use https, use web proxies instead [puppet] - 10https://gerrit.wikimedia.org/r/900401 (owner: 10David Caro)
[17:37:10] <wikibugs>	 (03PS1) 10Btullis: Remove the overridden configuration for airflow-platform_eng [puppet] - 10https://gerrit.wikimedia.org/r/900403 (https://phabricator.wikimedia.org/T326193)
[17:38:41] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40177/console" [puppet] - 10https://gerrit.wikimedia.org/r/900403 (https://phabricator.wikimedia.org/T326193) (owner: 10Btullis)
[17:39:18] <wikibugs>	 (03CR) 10Btullis: [V: 03+1 C: 03+2] Remove the overridden configuration for airflow-platform_eng [puppet] - 10https://gerrit.wikimedia.org/r/900403 (https://phabricator.wikimedia.org/T326193) (owner: 10Btullis)
[17:40:09] <wikibugs>	 (03PS2) 10David Caro: harbor: don't use https, use web proxies instead [puppet] - 10https://gerrit.wikimedia.org/r/900401
[17:40:11] <logmsgbot>	 !log ayounsi@cumin2002 START - Cookbook sre.netbox.update-extras rolling update on A:netbox-canary
[17:40:12] <logmsgbot>	 !log ayounsi@cumin2002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling update on A:netbox-canary
[17:40:46] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host thanos-fe1004.eqiad.wmnet with OS bullseye
[17:40:52] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004 - https://phabricator.wikimedia.org/T326846 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host thanos-fe1004.eqiad.wmnet with OS bullseye
[17:41:15] <wikibugs>	 (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/900317 (https://phabricator.wikimedia.org/T310590) (owner: 10Jbond)
[17:41:17] <wikibugs>	 (03PS3) 10David Caro: harbor: don't use https, use web proxies instead [puppet] - 10https://gerrit.wikimedia.org/r/900401
[17:41:29] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 0:25:00 on lvs4009.ulsfo.wmnet with reason: rebooting for kernel updates
[17:41:38] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:25:00 on lvs4009.ulsfo.wmnet with reason: rebooting for kernel updates
[17:43:33] <wikibugs>	 (03CR) 10David Caro: "This is currently deployed in toolsbeta successfully \o/" [puppet] - 10https://gerrit.wikimedia.org/r/900401 (owner: 10David Caro)
[17:43:38] <wikibugs>	 (03CR) 10David Caro: [V: 03+1] harbor: don't use https, use web proxies instead [puppet] - 10https://gerrit.wikimedia.org/r/900401 (owner: 10David Caro)
[17:44:32] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004 - https://phabricator.wikimedia.org/T326846 (10Cmjohnson)
[17:45:03] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service,session-c624.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:45:43] <wikibugs>	 (03PS1) 10Jbond: sre.hosts.reboot-single: args.depool not args.pool [cookbooks] - 10https://gerrit.wikimedia.org/r/900405
[17:46:16] <wikibugs>	 (03PS3) 10Jaime Nuche: deployment_server: ensure Docker is installed [puppet] - 10https://gerrit.wikimedia.org/r/900353 (https://phabricator.wikimedia.org/T329622)
[17:46:25] <icinga-wm>	 PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:47:03] <icinga-wm>	 PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:47:24] <sukhe>	 expected ^
[17:48:51] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service,session-c624.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:50:31] <wikibugs>	 (03CR) 10Jaime Nuche: "PCC: https://puppet-compiler.wmflabs.org/output/900353/40178/" [puppet] - 10https://gerrit.wikimedia.org/r/900353 (https://phabricator.wikimedia.org/T329622) (owner: 10Jaime Nuche)
[17:52:09] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] "I manually cherry-picked the netbox-extra patch on netbox-dev as well as checked that the symlink is still there." [puppet] - 10https://gerrit.wikimedia.org/r/900317 (https://phabricator.wikimedia.org/T310590) (owner: 10Jbond)
[17:52:34] <wikibugs>	 (03CR) 10Sharvaniharan: "Hi! Please review when you get a chance :-)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900399 (owner: 10Sharvaniharan)
[17:54:06] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+1] deployment_server: ensure Docker is installed [puppet] - 10https://gerrit.wikimedia.org/r/900353 (https://phabricator.wikimedia.org/T329622) (owner: 10Jaime Nuche)
[17:54:10] <wikibugs>	 (03PS1) 10David Caro: k8s: update harbor fqdn to the proxy url [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/900408
[17:54:26] <wikibugs>	 (03PS2) 10David Caro: k8s: update harbor fqdn to the proxy url [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/900408
[17:56:27] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service,session-c624.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:01:18] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] Modify netops Icinga checks to gracefully deal with SNMP timeout (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/900360 (https://phabricator.wikimedia.org/T332080) (owner: 10Cathal Mooney)
[18:01:49] <wikibugs>	 (03PS1) 10EoghanGaffney: Add doc host apache/php-fpm logs to kafka [puppet] - 10https://gerrit.wikimedia.org/r/900410 (https://phabricator.wikimedia.org/T325245)
[18:03:34] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.remove-downtime for lvs4009.ulsfo.wmnet
[18:03:35] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs4009.ulsfo.wmnet
[18:05:30] <brennen>	 jouncebot: nowandnext
[18:05:30] <jouncebot>	 For the next 1 hour(s) and 54 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230316T1800)
[18:05:30] <jouncebot>	 In 1 hour(s) and 54 minute(s): UTC late backport and config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230316T2000)
[18:05:43] <wikibugs>	 (03PS3) 10David Caro: k8s: update harbor fqdn to the proxy url [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/900408
[18:13:52] <wikibugs>	 (03CR) 10FNegri: harbor: don't use https, use web proxies instead (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/900401 (owner: 10David Caro)
[18:15:56] <wikibugs>	 (03PS4) 10David Caro: harbor: don't use https, use web proxies instead [puppet] - 10https://gerrit.wikimedia.org/r/900401
[18:17:10] <wikibugs>	 (03PS5) 10David Caro: harbor: don't use https, use web proxies instead [puppet] - 10https://gerrit.wikimedia.org/r/900401
[18:17:25] <wikibugs>	 (03CR) 10David Caro: harbor: don't use https, use web proxies instead (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/900401 (owner: 10David Caro)
[18:17:50] <wikibugs>	 (03CR) 10FNegri: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/900401 (owner: 10David Caro)
[18:18:06] <icinga-wm>	 RECOVERY - BGP status on cr3-ulsfo is OK: BGP OK - up: 91, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[18:19:54] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] harbor: don't use https, use web proxies instead [puppet] - 10https://gerrit.wikimedia.org/r/900401 (owner: 10David Caro)
[18:19:56] <icinga-wm>	 PROBLEM - Host ms-be2067 is DOWN: PING CRITICAL - Packet loss = 100%
[18:22:43] <sukhe>	 ^ known?
[18:22:46] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service,session-c624.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:24:18] <icinga-wm>	 RECOVERY - BGP status on cr4-ulsfo is OK: BGP OK - up: 107, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[18:26:04] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service,session-c624.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:26:32] <icinga-wm>	 RECOVERY - Host ms-be2067 is UP: PING OK - Packet loss = 0%, RTA = 31.58 ms
[18:27:13] <wikibugs>	 (03CR) 10Dbrant: [C: 03+1] Remove sampling from breadCrumbs schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900399 (owner: 10Sharvaniharan)
[18:28:20] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Add automation for EVPN BGP peerings [homer/public] - 10https://gerrit.wikimedia.org/r/894741 (https://phabricator.wikimedia.org/T327934) (owner: 10Cathal Mooney)
[18:28:52] <wikibugs>	 (03Merged) 10jenkins-bot: Add automation for EVPN BGP peerings [homer/public] - 10https://gerrit.wikimedia.org/r/894741 (https://phabricator.wikimedia.org/T327934) (owner: 10Cathal Mooney)
[18:29:02] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Two failed disks in ms-be2067 - https://phabricator.wikimedia.org/T331030 (10Papaul) 05Open→03Resolved  Physical Disk 0:2:19  Online  19  7451.5 GB SATA  HDD  No
[18:29:23] <wikibugs>	 10SRE, 10Domains, 10Traffic-Icebox: Register wiki(m|p)edia.ro - https://phabricator.wikimedia.org/T222080 (10BCornwall) Hi, @CRoslof! Have you been able to look into these registrations? Thanks!
[18:29:38] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service,session-c624.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:31:24] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2067 is CRITICAL: CRITICAL - degraded: The following units failed: srv-swift\x2dstorage-sdv1.mount https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:32:18] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Modify netops Icinga checks to gracefully deal with SNMP timeout (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/900360 (https://phabricator.wikimedia.org/T332080) (owner: 10Cathal Mooney)
[18:37:04] <logmsgbot>	 !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-fe1004.eqiad.wmnet with OS bullseye
[18:37:10] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004 - https://phabricator.wikimedia.org/T326846 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host thanos-fe1004.eqiad.wmnet with OS bullseye executed w...
[18:38:25] <wikibugs>	 (03CR) 10FNegri: [C: 03+1] k8s: update harbor fqdn to the proxy url [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/900408 (owner: 10David Caro)
[18:38:27] <logmsgbot>	 !log mvernon@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host ms-be2067.codfw.wmnet
[18:40:45] <logmsgbot>	 !log xcollazo@deploy2002 Started deploy [airflow-dags/platform_eng@5c2c701]: (no justification provided)
[18:40:58] <logmsgbot>	 !log xcollazo@deploy2002 Finished deploy [airflow-dags/platform_eng@5c2c701]: (no justification provided) (duration: 00m 13s)
[18:41:28] <wikibugs>	 10SRE, 10Traffic, 10VPS-project-Codesearch, 10serviceops, 10Patch-For-Review: Consider using BindsTo instead of Requires to declare dependencies between systemd unit - https://phabricator.wikimedia.org/T284555 (10BCornwall)
[18:44:00] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Automate EVPN switch underlay BGP neighbor peerings - https://phabricator.wikimedia.org/T327934 (10cmooney) 05Open→03Resolved Changes merged and feature now fully controlled from automation / homer.
[18:49:22] <wikibugs>	 10SRE, 10ops-eqiad, 10SRE Observability (FY2022/2023-Q3): Decommission centrallog1001 - https://phabricator.wikimedia.org/T328803 (10cmooney) @andrea.denisse just a heads up we got an alarm on our core routers in Eqiad for a BFD/BGP session down.  Seems this server was configured to BGP peer with the CRs? `...
[18:51:39] <wikibugs>	 (03PS1) 10Cathal Mooney: Remove BGP peering to centrallog1001 in eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/900422 (https://phabricator.wikimedia.org/T328803)
[18:52:27] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Remove BGP peering to centrallog1001 in eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/900422 (https://phabricator.wikimedia.org/T328803) (owner: 10Cathal Mooney)
[18:53:04] <wikibugs>	 (03Merged) 10jenkins-bot: Remove BGP peering to centrallog1001 in eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/900422 (https://phabricator.wikimedia.org/T328803) (owner: 10Cathal Mooney)
[18:58:35] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10observability, 10Patch-For-Review: SNMP Network Checks throw exception when device is unreachable - https://phabricator.wikimedia.org/T332080 (10cmooney) 05Open→03Resolved Closing this one, all our checks now deal with the scenario gracefully.  The exa...
[18:59:20] <icinga-wm>	 RECOVERY - BFD status on cr1-eqiad is OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:00:23] <sharvani_>	 Hi :wave
[19:00:44] <RhinosF1>	 Can we help sharvani_
[19:00:46] <sharvani_>	 here for the "UTC morning backport and config training" window 
[19:00:59] <RhinosF1>	 jouncebot: now
[19:00:59] <jouncebot>	 For the next 0 hour(s) and 59 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230316T1800)
[19:01:05] <RhinosF1>	 jouncebot: next
[19:01:05] <jouncebot>	 In 0 hour(s) and 58 minute(s): UTC late backport and config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230316T2000)
[19:01:14] <RhinosF1>	 sharvani_: you are an hour early
[19:02:01] <RhinosF1>	 sharvani_: what’s your phab username?
[19:02:19] <sharvani_>	 Sharvaniharan
[19:03:13] <RhinosF1>	 sharvani_: one moment
[19:04:02] <RhinosF1>	 I don’t see you as signed up
[19:04:32] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] varnish: ignore levelname field [puppet] - 10https://gerrit.wikimedia.org/r/891391 (https://phabricator.wikimedia.org/T330267) (owner: 10Cwhite)
[19:05:09] <RhinosF1>	 brennen, thcipriani: are you happy with sharvani_ doing the training?
[19:05:46] <sharvani_>	 I am not doing the training, just have a patch to be deployed.
[19:05:47] <RhinosF1>	 sharvani_: hang around, hopefully the trainers can confirm they are planning something
[19:06:03] <brennen>	 ah, if not attending the training, then yeah - wait for backport window.
[19:06:16] <RhinosF1>	 sharvani_: your patch isn’t scheduled
[19:06:37] <brennen>	 someone will likely be around to backport.  and yes, please add patch to schedule.
[19:08:43] <brennen>	 https://wikitech.wikimedia.org/wiki/Backport_windows#How_to_submit_a_patch_for_backport
[19:10:01] <sharvani_>	 Sorry had missed adding my ircnick to the request. Did that! thank you.
[19:10:47] <wikibugs>	 10SRE, 10Observability-Logging, 10Traffic: varnish-frontend-fetcherr sets incorrect level in logstash - https://phabricator.wikimedia.org/T330267 (10colewhite) 05Open→03Resolved a:03colewhite Fixes rolled out for SEVERITY_LABEL and levelName field.
[19:15:08] <icinga-wm>	 RECOVERY - BFD status on cr2-eqiad is OK: UP: 18 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:18:56] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to analytics_privatedata_users for FNavas-foundation - https://phabricator.wikimedia.org/T331482 (10MoritzMuehlenhoff) @FNavas-foundation Who's your manager? They need to sign off on this task. @Ottomata This needs approval for analytics-privatedata-users
[19:19:40] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:20:13] <wikibugs>	 10SRE, 10SRE-Access-Requests: Grant Access to analytics_privatedata_users for FNavas-foundation - https://phabricator.wikimedia.org/T331482 (10MoritzMuehlenhoff)
[19:27:54] <logmsgbot>	 !log xcollazo@deploy2002 Started deploy [airflow-dags/platform_eng@a587106]: (no justification provided)
[19:28:06] <logmsgbot>	 !log xcollazo@deploy2002 Finished deploy [airflow-dags/platform_eng@a587106]: (no justification provided) (duration: 00m 12s)
[19:31:56] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:32:09] <wikibugs>	 (03PS1) 10Jforrester: Revert "Upgrading lcobucci/jwt (4.1.5 => 4.3.0)" [vendor] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/900427 (https://phabricator.wikimedia.org/T321160)
[19:32:31] <wikibugs>	 (03PS1) 10Jforrester: Revert "build: Remove pinning of indirect lcobucci/jwt dependency" [extensions/OAuth] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/900144 (https://phabricator.wikimedia.org/T321160)
[19:35:13] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Grant Hal deployment rights - https://phabricator.wikimedia.org/T331647 (10Htriedman) just bumping this!
[19:35:26] <wikibugs>	 (03PS2) 10Jforrester: Revert "build: Remove pinning of indirect lcobucci/jwt dependency" [extensions/OAuth] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/900144 (https://phabricator.wikimedia.org/T321160)
[19:41:12] <wikibugs>	 10SRE, 10Traffic-Icebox: cache_upload varnish-fe exhausting transient memory - https://phabricator.wikimedia.org/T249809 (10BCornwall) 05Open→03Stalled @Vgutierrez, @BBlack: What's the status of this? It's three years old but I took a look at the transient memory and there are only a few instances where it...
[19:44:38] <wikibugs>	 10SRE, 10Fundraising-Backlog, 10Traffic-Icebox, 10fr-donorservices, and 2 others: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561 (10BCornwall) 05Open→03Resolved
[19:49:07] <wikibugs>	 (03PS1) 10Ryan Kemper: [WIP] wdqs: test new metric option [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/900430 (https://phabricator.wikimedia.org/T328306)
[19:49:56] <wikibugs>	 (03PS2) 10Ryan Kemper: [WIP] wdqs: test new metric option [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/900430 (https://phabricator.wikimedia.org/T328306)
[19:50:54] <brennen>	 note for backport deployers:  currently working on a train blocker, i may need you to hold off.
[19:51:00] <wikibugs>	 (03PS1) 10Cathal Mooney: Enable OSPF check by default for l3 switch mgmt interfaces [puppet] - 10https://gerrit.wikimedia.org/r/900431 (https://phabricator.wikimedia.org/T315053)
[19:51:08] <sharvani_>	 thank you for the update
[19:52:44] <wikibugs>	 (03PS3) 10Ryan Kemper: [WIP] wdqs: test new metric option [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/900430 (https://phabricator.wikimedia.org/T328306)
[19:54:27] <brennen>	 i'm just waiting at present on https://gerrit.wikimedia.org/r/c/mediawiki/vendor/+/900427/
[19:56:10] <wikibugs>	 (03CR) 10Brennen Bearnes: [C: 03+2] Revert "Upgrading lcobucci/jwt (4.1.5 => 4.3.0)" [vendor] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/900427 (https://phabricator.wikimedia.org/T321160) (owner: 10Jforrester)
[19:56:19] <brennen>	 (heh, would help to +2)
[19:59:01] <TheresNoTime>	 :p
[20:00:05] <jouncebot>	 brennen and TheresNoTime: How many deployers does it take to do UTC late backport and config training deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230316T2000).
[20:00:05] <jouncebot>	 sharvani_: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:09] * TheresNoTime can deploy, holding off deploy per brennen
[20:01:56] <brennen>	 thanks TheresNoTime, sharvani_.  i'll ping once this is safely out, you can do backports, and then we'll roll the train forward.
[20:02:09] <TheresNoTime>	 sure thing :)
[20:02:56] <sharvani_>	 Sure.. :)
[20:10:07] <wikibugs>	 (03PS1) 10Andrew Bogott: Trove: adjust volume formats a whole lot [puppet] - 10https://gerrit.wikimedia.org/r/900436
[20:10:45] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Upgrading lcobucci/jwt (4.1.5 => 4.3.0)" [vendor] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/900427 (https://phabricator.wikimedia.org/T321160) (owner: 10Jforrester)
[20:10:52] <wikibugs>	 (03PS2) 10Andrew Bogott: Trove: adjust volume format timeouts again [puppet] - 10https://gerrit.wikimedia.org/r/900436
[20:12:43] <logmsgbot>	 !log brennen@deploy2002 Started scap: Backport for [[gerrit:900427|Revert "Upgrading lcobucci/jwt (4.1.5 => 4.3.0)" (T321160)]]
[20:12:49] <stashbot>	 T321160: Lcobucci\JWT\Signer\InvalidKeyProvided: Key cannot be empty - https://phabricator.wikimedia.org/T321160
[20:13:15] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Trove: adjust volume format timeouts again [puppet] - 10https://gerrit.wikimedia.org/r/900436 (owner: 10Andrew Bogott)
[20:14:38] <logmsgbot>	 !log brennen@deploy2002 brennen and jforrester: Backport for [[gerrit:900427|Revert "Upgrading lcobucci/jwt (4.1.5 => 4.3.0)" (T321160)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet
[20:18:59] <wikibugs>	 10SRE, 10SRE-Access-Requests: Grant Access to analytics_privatedata_users for FNavas-foundation - https://phabricator.wikimedia.org/T331482 (10FNavas-foundation) @MoritzMuehlenhoff my manager is @RBrounley_WMF (he is off on holiday at the moment).  @Aklapper  is there someone who can help me untangle this? per...
[20:21:49] <logmsgbot>	 !log brennen@deploy2002 Finished scap: Backport for [[gerrit:900427|Revert "Upgrading lcobucci/jwt (4.1.5 => 4.3.0)" (T321160)]] (duration: 09m 06s)
[20:21:55] <stashbot>	 T321160: Lcobucci\JWT\Signer\InvalidKeyProvided: Key cannot be empty - https://phabricator.wikimedia.org/T321160
[20:22:45] <brennen>	 ok, vendor patch deployed; nothing seems to be exploding, at least.
[20:23:06] <brennen>	 TheresNoTime: give me 2 min to see if bug is fixed, then it's all yours.
[20:23:11] <TheresNoTime>	 ack :)
[20:25:06] <brennen>	 TheresNoTime: all yours
[20:25:12] <TheresNoTime>	 great
[20:25:18] <TheresNoTime>	 sharvani_: you ready?
[20:25:21] <wikibugs>	 (03PS1) 10JHathaway: jaeger: upgrade chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/900438 (https://phabricator.wikimedia.org/T320554)
[20:25:26] <sharvani_>	 ready! 
[20:25:30] <sharvani_>	 thank you :)
[20:25:33] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900399 (owner: 10Sharvaniharan)
[20:26:18] <wikibugs>	 (03Merged) 10jenkins-bot: Remove sampling from breadCrumbs schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900399 (owner: 10Sharvaniharan)
[20:26:40] <TheresNoTime>	 sharvani_: am I correct in saying this is something which won't need testing on the mwdebug servers?
[20:26:42] <logmsgbot>	 !log samtar@deploy2002 Started scap: Backport for [[gerrit:900399|Remove sampling from breadCrumbs schema]]
[20:26:59] <sharvani_>	 I can test it... is it deployed on 2002?
[20:27:16] <TheresNoTime>	 not yet, I'll let you know :)
[20:27:54] <sharvani_>	 ok ready to test whenever it is done :)
[20:28:30] <logmsgbot>	 !log samtar@deploy2002 samtar and sharvaniharan: Backport for [[gerrit:900399|Remove sampling from breadCrumbs schema]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet
[20:28:42] <TheresNoTime>	 sharvani_: that's live on the mwdebug servers for testing
[20:28:53] <sharvani_>	 tested and looks good!
[20:28:59] <TheresNoTime>	 great, syncing now
[20:29:09] <sharvani_>	 thank you for deploying for me!
[20:34:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[20:35:01] <logmsgbot>	 !log samtar@deploy2002 Finished scap: Backport for [[gerrit:900399|Remove sampling from breadCrumbs schema]] (duration: 08m 18s)
[20:35:11] <TheresNoTime>	 and that's now live sharvani_ :)
[20:35:28] <sharvani_>	 Perfect! thank you!
[20:35:58] <TheresNoTime>	 !log close UTC late backport window
[20:36:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:36:49] <brennen>	 !log 1.40.0-wmf.27 train (T330205): blockers hopefully resolved, rolling to all wikis
[20:36:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:36:54] <stashbot>	 T330205: 1.40.0-wmf.27 deployment blockers - https://phabricator.wikimedia.org/T330205
[20:37:52] <wikibugs>	 (03PS1) 10JHathaway: jaeger: Add network policy support [deployment-charts] - 10https://gerrit.wikimedia.org/r/900443 (https://phabricator.wikimedia.org/T320554)
[20:38:49] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] jaeger: upgrade chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/900438 (https://phabricator.wikimedia.org/T320554) (owner: 10JHathaway)
[20:39:40] <wikibugs>	 (03PS1) 10TrainBranchBot: all wikis to 1.40.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900444 (https://phabricator.wikimedia.org/T330205)
[20:39:42] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] all wikis to 1.40.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900444 (https://phabricator.wikimedia.org/T330205) (owner: 10TrainBranchBot)
[20:39:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[20:40:26] <wikibugs>	 (03Merged) 10jenkins-bot: all wikis to 1.40.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900444 (https://phabricator.wikimedia.org/T330205) (owner: 10TrainBranchBot)
[20:47:49] <logmsgbot>	 !log brennen@deploy2002 rebuilt and synchronized wikiversions files: all wikis to 1.40.0-wmf.27  refs T330205
[20:47:55] <stashbot>	 T330205: 1.40.0-wmf.27 deployment blockers - https://phabricator.wikimedia.org/T330205
[20:48:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[20:49:21] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] jaeger: Add network policy support [deployment-charts] - 10https://gerrit.wikimedia.org/r/900443 (https://phabricator.wikimedia.org/T320554) (owner: 10JHathaway)
[20:52:10] <wikibugs>	 (03CR) 10Ottomata: [C: 03+1] "We can probably do the same for analytics-research-admins" [puppet] - 10https://gerrit.wikimedia.org/r/899653 (https://phabricator.wikimedia.org/T331647) (owner: 10Muehlenhoff)
[20:53:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[20:57:34] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] Enable OSPF check by default for l3 switch mgmt interfaces [puppet] - 10https://gerrit.wikimedia.org/r/900431 (https://phabricator.wikimedia.org/T315053) (owner: 10Cathal Mooney)
[21:04:20] <wikibugs>	 (03PS1) 10Ayounsi: Add cloudsw1-b1-codfw to Rancid [puppet] - 10https://gerrit.wikimedia.org/r/900448 (https://phabricator.wikimedia.org/T327919)
[21:09:03] <jinxer-wm>	 (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:14:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:17:11] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "https://puppet-compiler.wmflabs.org/output/900313/40180/" [puppet] - 10https://gerrit.wikimedia.org/r/900313 (https://phabricator.wikimedia.org/T329678) (owner: 10Jaime Nuche)
[21:19:04] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "The production deployment servers nowadays use role::deployment_server::kubernetes, but role::deployment_server is also still around, prob" [puppet] - 10https://gerrit.wikimedia.org/r/900353 (https://phabricator.wikimedia.org/T329622) (owner: 10Jaime Nuche)
[21:20:51] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "deployment_server::kubernetes (the one applied on deploy1002/2002) already has  include profile::docker::engine  but not the prune part.  " [puppet] - 10https://gerrit.wikimedia.org/r/900353 (https://phabricator.wikimedia.org/T329622) (owner: 10Jaime Nuche)
[21:21:39] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+1] deployment_server: ensure Docker is installed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/900353 (https://phabricator.wikimedia.org/T329622) (owner: 10Jaime Nuche)
[21:21:43] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] Adds php and apache logs for doc machines [puppet] - 10https://gerrit.wikimedia.org/r/900375 (https://phabricator.wikimedia.org/T325245) (owner: 10EoghanGaffney)
[21:34:55] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add approvers for analytics-platform-eng-admins [puppet] - 10https://gerrit.wikimedia.org/r/899653 (https://phabricator.wikimedia.org/T331647) (owner: 10Muehlenhoff)
[21:44:18] <wikibugs>	 (03PS1) 10JHathaway: aux: add network policy for jaeger [deployment-charts] - 10https://gerrit.wikimedia.org/r/900453 (https://phabricator.wikimedia.org/T320554)
[21:53:21] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] aux: add network policy for jaeger [deployment-charts] - 10https://gerrit.wikimedia.org/r/900453 (https://phabricator.wikimedia.org/T320554) (owner: 10JHathaway)
[21:54:50] <logmsgbot>	 !log jhathaway@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'.
[22:02:56] <wikibugs>	 (03PS1) 10JHathaway: aux: bump jaeger version [deployment-charts] - 10https://gerrit.wikimedia.org/r/900457 (https://phabricator.wikimedia.org/T320554)
[22:04:59] <logmsgbot>	 !log jhathaway@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[22:08:33] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] aux: bump jaeger version [deployment-charts] - 10https://gerrit.wikimedia.org/r/900457 (https://phabricator.wikimedia.org/T320554) (owner: 10JHathaway)
[22:09:22] <logmsgbot>	 !log jhathaway@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'.
[22:09:35] <logmsgbot>	 !log jhathaway@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[22:26:23] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] Add doc host apache/php-fpm logs to kafka [puppet] - 10https://gerrit.wikimedia.org/r/900410 (https://phabricator.wikimedia.org/T325245) (owner: 10EoghanGaffney)
[22:28:06] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] Allow E_DEPRECATED logs to be shown on php-fpm in doc machines [puppet] - 10https://gerrit.wikimedia.org/r/900369 (https://phabricator.wikimedia.org/T325245) (owner: 10EoghanGaffney)
[22:29:17] <logmsgbot>	 !log dzahn@cumin2002 START - Cookbook sre.ganeti.makevm for new host miscweb2003.codfw.wmnet
[22:29:19] <logmsgbot>	 !log dzahn@cumin2002 START - Cookbook sre.dns.netbox
[22:31:34] <logmsgbot>	 !log dzahn@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM miscweb2003.codfw.wmnet - dzahn@cumin2002"
[22:32:40] <logmsgbot>	 !log dzahn@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM miscweb2003.codfw.wmnet - dzahn@cumin2002"
[22:32:40] <logmsgbot>	 !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[22:32:40] <logmsgbot>	 !log dzahn@cumin2002 START - Cookbook sre.dns.wipe-cache miscweb2003.codfw.wmnet on all recursors
[22:32:43] <logmsgbot>	 !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) miscweb2003.codfw.wmnet on all recursors
[22:35:36] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm for new host miscweb1003.eqiad.wmnet
[22:35:37] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.dns.netbox
[22:37:36] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] deployment_server: ensure Docker is installed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/900353 (https://phabricator.wikimedia.org/T329622) (owner: 10Jaime Nuche)
[22:38:35] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM miscweb1003.eqiad.wmnet - dzahn@cumin1001"
[22:39:42] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM miscweb1003.eqiad.wmnet - dzahn@cumin1001"
[22:39:42] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[22:39:43] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.dns.wipe-cache miscweb1003.eqiad.wmnet on all recursors
[22:39:46] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) miscweb1003.eqiad.wmnet on all recursors
[22:42:27] <logmsgbot>	 !log dzahn@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host miscweb2003.codfw.wmnet
[22:42:30] <wikibugs>	 (03CR) 10Dzahn: deployment_server: ensure Docker is installed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/900353 (https://phabricator.wikimedia.org/T329622) (owner: 10Jaime Nuche)
[22:49:29] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host miscweb1003.eqiad.wmnet
[22:50:46] <wikibugs>	 (03PS1) 10Dzahn: set target quarter for miscweb bullseye upgrade to 2023-1 [puppet] - 10https://gerrit.wikimedia.org/r/900463 (https://phabricator.wikimedia.org/T291916)
[22:53:09] <wikibugs>	 (03PS1) 10Dzahn: remove role::webserver_misc_apps from sre module [puppet] - 10https://gerrit.wikimedia.org/r/900464
[22:54:21] <wikibugs>	 (03PS1) 10Dzahn: miscweb: add miscweb1003/2003 to rsync_dst_hosts [puppet] - 10https://gerrit.wikimedia.org/r/900465 (https://phabricator.wikimedia.org/T331896)
[22:58:00] <wikibugs>	 (03PS1) 10Dzahn: site: add miscweb1003/miscweb2003 with role insetup::serviceops_collab [puppet] - 10https://gerrit.wikimedia.org/r/900486 (https://phabricator.wikimedia.org/T331896)
[22:58:53] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] site: add miscweb1003/miscweb2003 with role insetup::serviceops_collab [puppet] - 10https://gerrit.wikimedia.org/r/900486 (https://phabricator.wikimedia.org/T331896) (owner: 10Dzahn)
[23:00:31] <wikibugs>	 10SRE: Setup PAWS internal experimentally on notebook* nodes - https://phabricator.wikimedia.org/T149543 (10nshahquinn-wmf) 05Open→03Resolved This has been done for a long time. See [wikitech:Data Engineering/Systems/Jupyter](https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Jupyter) for details.
[23:00:31] <logmsgbot>	 !log dzahn@cumin2002 START - Cookbook sre.ganeti.reimage for host miscweb2003.codfw.wmnet with OS bullseye
[23:01:35] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.ganeti.reimage for host miscweb1003.eqiad.wmnet with OS bullseye
[23:11:31] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on miscweb1003.eqiad.wmnet with reason: host reimage
[23:12:31] <wikibugs>	 (03PS1) 10Jdlrobson: Make messages about editing site code more prominent [core] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/900467 (https://phabricator.wikimedia.org/T311891)
[23:14:50] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on miscweb1003.eqiad.wmnet with reason: host reimage
[23:15:29] <logmsgbot>	 !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on miscweb2003.codfw.wmnet with reason: host reimage
[23:18:54] <logmsgbot>	 !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on miscweb2003.codfw.wmnet with reason: host reimage
[23:20:08] <logmsgbot>	 !log ebernhardson@deploy2002 Started deploy [airflow-dags/search@e6f0142]: bump discolytics env to 0.7.0
[23:20:27] <logmsgbot>	 !log ebernhardson@deploy2002 Finished deploy [airflow-dags/search@e6f0142]: bump discolytics env to 0.7.0 (duration: 00m 19s)
[23:28:41] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host miscweb1003.eqiad.wmnet with OS bullseye
[23:31:40] <logmsgbot>	 !log dzahn@cumin2002 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host miscweb2003.codfw.wmnet with OS bullseye
[23:32:56] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] miscweb: add monitoring for annual.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/898994 (https://phabricator.wikimedia.org/T327976) (owner: 10Dzahn)
[23:33:21] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 0:25:00 on lvs3007.esams.wmnet with reason: rebooting for kernel updates
[23:33:36] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:25:00 on lvs3007.esams.wmnet with reason: rebooting for kernel updates
[23:34:08] <wikibugs>	 (03PS2) 10Dzahn: miscweb: add monitoring for bienvenida.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/898995 (https://phabricator.wikimedia.org/T327976)
[23:35:57] <wikibugs>	 (03PS2) 10Dzahn: remove role::webserver_misc_apps from sre module [puppet] - 10https://gerrit.wikimedia.org/r/900464
[23:36:15] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] miscweb: add monitoring for bienvenida.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/898995 (https://phabricator.wikimedia.org/T327976) (owner: 10Dzahn)
[23:36:47] <icinga-wm>	 PROBLEM - BGP status on cr3-esams is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[23:37:11] <icinga-wm>	 PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[23:37:11] <sukhe>	 ^ expected
[23:38:23] <icinga-wm>	 RECOVERY - BGP status on cr3-esams is OK: BGP OK - up: 20, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[23:38:49] <icinga-wm>	 RECOVERY - BGP status on cr2-esams is OK: BGP OK - up: 477, down: 1, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[23:40:58] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 0:10:00 on lvs6003.drmrs.wmnet with reason: rebooting for kernel updates
[23:41:13] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on lvs6003.drmrs.wmnet with reason: rebooting for kernel updates
[23:45:55] <icinga-wm>	 PROBLEM - BGP status on asw1-b12-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[23:46:03] <sukhe>	 ^ expected
[23:46:28] <wikibugs>	 (03Restored) 10Ssingh: icinga: add ssingh to cgi.cfg [puppet] - 10https://gerrit.wikimedia.org/r/767530 (owner: 10Ssingh)
[23:47:44] <wikibugs>	 (03Abandoned) 10Ssingh: icinga: add ssingh to cgi.cfg [puppet] - 10https://gerrit.wikimedia.org/r/767530 (owner: 10Ssingh)
[23:47:51] <icinga-wm>	 RECOVERY - BGP status on asw1-b12-drmrs.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[23:50:39] <wikibugs>	 (03PS1) 10Dzahn: miscweb: adjust monitoring for annual.wikimedia.org to check /2017/ [puppet] - 10https://gerrit.wikimedia.org/r/900508 (https://phabricator.wikimedia.org/T327976)
[23:51:02] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] miscweb: adjust monitoring for annual.wikimedia.org to check /2017/ [puppet] - 10https://gerrit.wikimedia.org/r/900508 (https://phabricator.wikimedia.org/T327976) (owner: 10Dzahn)
[23:51:19] <wikibugs>	 (03PS2) 10Dzahn: miscweb: adjust monitoring for annual.wikimedia.org to check /2017/ [puppet] - 10https://gerrit.wikimedia.org/r/900508 (https://phabricator.wikimedia.org/T327976)
[23:51:41] <wikibugs>	 (03PS1) 10Ssingh: icinga: add ssingh to cgi.cfg [puppet] - 10https://gerrit.wikimedia.org/r/900509
[23:51:43] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] miscweb: adjust monitoring for annual.wikimedia.org to check /2017/ [puppet] - 10https://gerrit.wikimedia.org/r/900508 (https://phabricator.wikimedia.org/T327976) (owner: 10Dzahn)
[23:52:33] <wikibugs>	 (03PS3) 10Dzahn: miscweb: adjust monitoring for annual.wikimedia.org to check /2017/ [puppet] - 10https://gerrit.wikimedia.org/r/900508 (https://phabricator.wikimedia.org/T327976)
[23:54:17] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] miscweb: adjust monitoring for annual.wikimedia.org to check /2017/ [puppet] - 10https://gerrit.wikimedia.org/r/900508 (https://phabricator.wikimedia.org/T327976) (owner: 10Dzahn)