[00:00:28] Even the long prefixes will be shorter than the entire domain name... I think [00:00:34] we had ALL the .wiki domains at first [00:00:40] en.wiki de.wiki .. and so on [00:00:52] but that's still wikipedia centric :) [00:01:35] TheresNoTime: the language code interwikis are project context sensitive. [[en:foo]] on fr.wikisource.org routes to enwikisource.org. [00:02:18] Ahh [00:02:24] wikipedia is the default though [00:02:31] there's a weird thing in that that on commons or meta the 'en' interwiki goes to wikipedia, but that's mostly weirdness [00:03:03] Makes sense for common considering most interwiki links are linking to -> thing in image, generally speaking. [00:03:13] Well you do `:w:en:` or something right? [00:03:19] I should know this [00:03:36] Eh, every time I make an interwiki link I have to check and I still don't remember [00:05:47] perryprog: haha, yea, and then you might think "is it really easier than if I could paste the full URL and link [ ] instead of [[]] ? :) [00:06:06] noooo it has the little square thingy and a different visited link color [00:06:11] if you do someone will fix it for you either way though :) and tell you how you should do it right [00:06:39] Full URLs in log entries is the way /s [00:07:43] I'm a firm believer that enterprise's diff-permalink.js script is possibly the best gadget ever written. (https://enwp.org/User:Enterprisey/diff-permalink.js) [00:07:57] And that's only slight hyperbole. [00:08:41] Yet another of their scripts which should be in core [00:12:18] just paste the full URL into VisualEditor, it cleans it up for you and even interwikis if possible [00:13:36] (2017 WTE does the same) [00:15:02] what's visualeditor [00:15:15] (also I don't see that happening in either) [00:15:43] Oh I see, it does it sneakily [00:16:13] gah that'd be really useful if I used VE [00:16:32] Though I don't see it happening in 2017 WTE (which I do use) [00:17:21] Ohh it's just when you use the shortcut and with a different link text. I'll shut up now. [00:25:05] yeah, you need to use the link insertion dialog [00:31:32] (03PS1) 10Tim Starling: Unprovision the "swift" dashboard [puppet] - 10https://gerrit.wikimedia.org/r/899885 (https://phabricator.wikimedia.org/T328872) [00:36:26] (03PS5) 10Aaron Schulz: DNM: add per-action component-level profiling in statsd using excimer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893839 (https://phabricator.wikimedia.org/T225968) [00:49:20] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:00:38] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:06:45] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:14:34] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Two failed disks in ms-be2067 - https://phabricator.wikimedia.org/T331030 (10Papaul) Physical Disk 0:2:19 Foreign 19 7451.5 GB SATA HDD No [02:21:45] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:45:26] 10SRE, 10Domains, 10Traffic: Acquire enwp.org - https://phabricator.wikimedia.org/T332220 (10Legoktm) >>! In T332220#8700530, @violetwtf wrote: > I think this spans a bit outside of "literally everything" -- enwp.org is widely used by Wikipedia editors in Wikipedia-adjacent channels to refer to Wikipedia. I... [03:21:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [04:04:44] PROBLEM - Check systemd state on cephosd1001 is CRITICAL: CRITICAL - degraded: The following units failed: ceph-mgr@cephosd1001.service,ceph-mon@cephosd1001.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:25:04] PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1001 is CRITICAL: 3.758e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [04:56:37] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [05:09:58] (03PS1) 10Gergő Tisza: Leveling up: Backport recent changes [extensions/GrowthExperiments] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/900026 (https://phabricator.wikimedia.org/T322387) [05:17:51] (03CR) 10Gergő Tisza: Leveling up: Backport recent changes (031 comment) [extensions/GrowthExperiments] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/900026 (https://phabricator.wikimedia.org/T322387) (owner: 10Gergő Tisza) [05:23:06] RECOVERY - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1001 is OK: (C)1e+05 gt (W)1e+04 gt 8821 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [05:57:23] (03PS1) 10Marostegui: Revert "db1176: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/899602 [05:57:47] (03CR) 10Marostegui: [C: 03+2] Revert "db1176: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/899602 (owner: 10Marostegui) [05:58:45] (03PS1) 10Marostegui: Revert "mariadb: Promote db1106 to m5 master" [puppet] - 10https://gerrit.wikimedia.org/r/899603 [05:58:51] (03PS2) 10Marostegui: Revert "mariadb: Promote db1106 to m5 master" [puppet] - 10https://gerrit.wikimedia.org/r/899603 [05:59:28] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 6 hosts with reason: m5 master switch T332155 [05:59:33] T332155: Switchover m5 master (db1106 -> db1176) - https://phabricator.wikimedia.org/T332155 [05:59:55] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 6 hosts with reason: m5 master switch T332155 [06:00:05] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230316T0600) [06:00:05] kormat, marostegui, and Amir1: gettimeofday() says it's time for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230316T0600) [06:01:52] (03CR) 10Marostegui: [C: 03+2] Revert "mariadb: Promote db1106 to m5 master" [puppet] - 10https://gerrit.wikimedia.org/r/899603 (owner: 10Marostegui) [06:03:50] !log Failover m5 from db1106 to db1176 - T332155 [06:03:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:09:31] (03PS1) 10Marostegui: Revert "mariadb: Move db1106 to m5" [puppet] - 10https://gerrit.wikimedia.org/r/899604 [06:09:42] (03CR) 10CI reject: [V: 04-1] Revert "mariadb: Move db1106 to m5" [puppet] - 10https://gerrit.wikimedia.org/r/899604 (owner: 10Marostegui) [06:13:41] (03Abandoned) 10Marostegui: Revert "mariadb: Move db1106 to m5" [puppet] - 10https://gerrit.wikimedia.org/r/899604 (owner: 10Marostegui) [06:22:09] (03PS1) 10Marostegui: instances.yaml: Remove db1105 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/900051 (https://phabricator.wikimedia.org/T331874) [06:22:34] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Remove db1105 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/900051 (https://phabricator.wikimedia.org/T331874) (owner: 10Marostegui) [06:23:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db1105 from dbctl T331874', diff saved to https://phabricator.wikimedia.org/P45883 and previous config saved to /var/cache/conftool/dbconfig/20230316-062307-root.json [06:23:13] T331874: decommission db1105.eqiad.wmnet - https://phabricator.wikimedia.org/T331874 [06:29:20] (03CR) 10Krinkle: "Thanks for restoring these! We had renamed these in the Prometheus clean up some months ago." [alerts] - 10https://gerrit.wikimedia.org/r/899506 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [06:59:02] (03PS1) 10Marostegui: mariadb: Move db1106 to s1 [puppet] - 10https://gerrit.wikimedia.org/r/900086 (https://phabricator.wikimedia.org/T332270) [06:59:48] (03CR) 10Marostegui: [C: 03+2] mariadb: Move db1106 to s1 [puppet] - 10https://gerrit.wikimedia.org/r/900086 (https://phabricator.wikimedia.org/T332270) (owner: 10Marostegui) [07:00:05] Amir1, apergos, and jnuche: OwO what's this, a deployment window?? UTC morning backport and config training. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230316T0700). nyaa~ [07:00:05] tgr: A patch you scheduled for UTC morning backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:09] morning! there are no trainees signed up today, and one patch in the window. I note that it consists of multiple php files; bear in mind that scap does not guarantee the order that these files will be copied into place; are the changes in each file independent of each other, so that there will be no errors? tgr_ that question's for you. and I assume you will be self-deploying today? [07:01:42] (03PS1) 10Marostegui: db1106: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/900090 (https://phabricator.wikimedia.org/T332270) [07:02:29] (03CR) 10Marostegui: [C: 03+2] db1106: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/900090 (https://phabricator.wikimedia.org/T332270) (owner: 10Marostegui) [07:04:57] apergos: yeah. The code is only executed on group2 so should be fine. [07:05:36] it's not a matter of which group of wikis but rather any interdependency among the patched phph modules (if there is any) [07:05:46] *php [07:06:20] well, if the code won't run, the interdependencies won't matter [07:06:31] and .27 is not on group2 yet [07:07:25] ah so the branch isn't out to group2, got it [07:07:29] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by tgr@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/900026 (https://phabricator.wikimedia.org/T322387) (owner: 10Gergő Tisza) [07:25:56] (03Merged) 10jenkins-bot: Leveling up: Backport recent changes [extensions/GrowthExperiments] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/900026 (https://phabricator.wikimedia.org/T322387) (owner: 10Gergő Tisza) [07:26:25] !log tgr@deploy2002 Started scap: Backport for [[gerrit:900026|Leveling up: Backport recent changes]] [07:28:05] !log tgr@deploy2002 tgr: Backport for [[gerrit:900026|Leveling up: Backport recent changes]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet [07:31:25] 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack C8 - https://phabricator.wikimedia.org/T308339 (10Marostegui) I am removing the DBA tag from this task as there are no more databases pending here. I will remain subscribed in case I am needed. [07:31:44] (03CR) 10Elukey: [C: 03+1] "LGTM! Could you pcc just to be sure?" [puppet] - 10https://gerrit.wikimedia.org/r/899641 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm) [07:31:45] 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack C8 - https://phabricator.wikimedia.org/T308339 (10Marostegui) Oh, nevermind, db1131 is still to be moved. [07:32:04] (03CR) 10Elukey: [C: 03+1] k8s: Make profile::kubernetes::pki::intermediate mandatory [puppet] - 10https://gerrit.wikimedia.org/r/899642 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm) [07:33:52] looks like the mediawiki-errors dashboard's chart isn't working [07:33:55] not ideal for deployments [07:34:05] not too good is it [07:34:38] !log tgr@deploy2002 Finished scap: Backport for [[gerrit:900026|Leveling up: Backport recent changes]] (duration: 08m 13s) [07:34:58] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (PATCH pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:36:01] !log UTC morning deploys done [07:36:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:13] ah it's all good? [07:36:25] I was just going to ask how testing was going on the production cluster [07:36:55] welp, never mind then, and see everyone next time! [07:36:58] Well, I hope. I can't see the error trends. But in theory this deploy should have been a no-op. [07:37:13] (Also not testable) [07:37:29] I looked at the logs from mwlog2002 and its fine [07:37:43] cool, thanks [07:37:54] I can keep that tab open for a few more minutes if there's a concern [07:38:04] please report the issue with the one dashboard, that's concerning [07:39:58] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (PATCH pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:41:23] (03CR) 10Elukey: [C: 03+1] k8s: Remove 1.16 related code [puppet] - 10https://gerrit.wikimedia.org/r/899652 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm) [07:41:38] (03Abandoned) 10Elukey: services: add staging config for Lift Wing to the API gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/898741 (owner: 10Elukey) [07:44:26] filed: https://phabricator.wikimedia.org/T332273 [07:45:02] subscribed, thanks [07:45:24] still nothing new in the logs so I'm calling it good [07:55:13] (03CR) 10Ayounsi: [C: 03+1] Move cloudsw prefix-list filters from templates to YAML [homer/public] - 10https://gerrit.wikimedia.org/r/898684 (https://phabricator.wikimedia.org/T327919) (owner: 10Cathal Mooney) [07:57:22] 10SRE, 10Release-Engineering-Team, 10Wikimedia-Logstash: mediawiki-errors logstash dashboard's "errors over time" panel broken - https://phabricator.wikimedia.org/T332273 (10kostajh) [07:59:48] apergos: can I backport two more patches, or should I leave it for later? [08:00:11] (03PS1) 10Kosta Harlan: Leveling up: always set wgGELevelingUpEnabledForUser [extensions/GrowthExperiments] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/899605 (https://phabricator.wikimedia.org/T332227) [08:00:27] window is over, it's already the hour passed, kostajh [08:00:29] (03PS1) 10Kosta Harlan: SuggestedEditSession: Fix handling of post-save data refresh [extensions/GrowthExperiments] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/900126 [08:00:49] so next window now (daylight confusion time might have got ya) [08:01:13] I know it's just ending/ended, but if there's not anything next it might be ok to just do now [08:02:32] I don't know, the citoid window starts now, you'd need to talk to them [08:03:38] it says it's in 2 hours? https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230316T1000 [08:04:13] https://wikitech.wikimedia.org/wiki/Deployments#Thursday,_March_16 oh I see, I'm mixing up the timezones [08:04:20] well what do you have? are you a self deployer? [08:04:30] (sorry I always have to ask, I never remember...) [08:04:31] yes, I can deploy myself [08:05:18] and which is the patch? [08:06:12] apergos: just added to imedia.org/wiki/Deployments#deploycal-item-20230316T0700 [08:06:29] 10SRE, 10Release-Engineering-Team, 10SRE Observability, 10Wikimedia-Logstash: mediawiki-errors logstash dashboard's "errors over time" panel broken - https://phabricator.wikimedia.org/T332273 (10MoritzMuehlenhoff) [08:06:54] how long are these going to take to merge? [08:07:01] ~20 minutes I imagine [08:07:28] ok go ahead (but next time try to get here for the window ;-) ) [08:07:40] thx [08:08:11] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for ItamarWMDE - https://phabricator.wikimedia.org/T331899 (10MoritzMuehlenhoff) [08:08:59] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/900126 (owner: 10Kosta Harlan) [08:09:06] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/899605 (https://phabricator.wikimedia.org/T332227) (owner: 10Kosta Harlan) [08:10:12] (03PS1) 10Muehlenhoff: Add SSH key and Kerberos principal for ptiwary [puppet] - 10https://gerrit.wikimedia.org/r/900122 (https://phabricator.wikimedia.org/T332214) [08:11:50] !log additional deployments for the UTC morning backport and config training window, running into the next hour, so window re-opened [08:11:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:59] (03CR) 10Muehlenhoff: [C: 03+2] Add SSH key and Kerberos principal for ptiwary [puppet] - 10https://gerrit.wikimedia.org/r/900122 (https://phabricator.wikimedia.org/T332214) (owner: 10Muehlenhoff) [08:13:03] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to ssh for jupyter notebooks for Prabhat - https://phabricator.wikimedia.org/T332214 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff @prabhat I've enabled your access, but it will take up to 30 minutes until the change ha... [08:27:19] (03Merged) 10jenkins-bot: SuggestedEditSession: Fix handling of post-save data refresh [extensions/GrowthExperiments] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/900126 (owner: 10Kosta Harlan) [08:27:25] (03Merged) 10jenkins-bot: Leveling up: always set wgGELevelingUpEnabledForUser [extensions/GrowthExperiments] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/899605 (https://phabricator.wikimedia.org/T332227) (owner: 10Kosta Harlan) [08:27:38] apergos: verifying the patch on testwiki, then I'll sync [08:27:42] !log kharlan@deploy2002 Started scap: Backport for [[gerrit:900126|SuggestedEditSession: Fix handling of post-save data refresh]], [[gerrit:899605|Leveling up: always set wgGELevelingUpEnabledForUser (T332227)]] [08:27:48] T332227: Leveling up: Try new task panel does not display for edits made with source editor - https://phabricator.wikimedia.org/T332227 [08:29:15] !log kharlan@deploy2002 kharlan: Backport for [[gerrit:900126|SuggestedEditSession: Fix handling of post-save data refresh]], [[gerrit:899605|Leveling up: always set wgGELevelingUpEnabledForUser (T332227)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [08:30:56] I've got the logs pulled up and ready [08:33:11] apergos: lgtm from the user side [08:34:00] a disclaimer, I have the production logs pulled up, sory, I should have been clearer [08:34:11] mwdebug-logs look ok [08:34:13] I'm syncing [08:34:20] okey dokey [08:37:01] (03PS1) 10Kosta Harlan: GrowthExperiments: Enable LevelingUp features on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900196 (https://phabricator.wikimedia.org/T317813) [08:37:27] apergos: can I also sneak ^ in? then I'm done, really 😅 [08:37:36] uh [08:37:42] there's only 20 minutes left [08:37:59] that one is just a config patch, IIRC that takes ~2-5 minutes [08:38:13] but maybe it is a longer process depending on docker image building. I can leave it for later if you prefer. [08:38:19] ok (but if it doesn't, I get to whine about it in your ar mercilessly for awhile :-P ) [08:38:33] I'd rather you leave it for next time [08:38:39] yeah I'll leave it [08:38:39] there's lots of second chances:-) [08:39:53] ack [08:40:13] !log kharlan@deploy2002 Finished scap: Backport for [[gerrit:900126|SuggestedEditSession: Fix handling of post-save data refresh]], [[gerrit:899605|Leveling up: always set wgGELevelingUpEnabledForUser (T332227)]] (duration: 12m 30s) [08:40:18] T332227: Leveling up: Try new task panel does not display for edits made with source editor - https://phabricator.wikimedia.org/T332227 [08:40:48] !log UTC morning deploys (second round) done [08:40:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:03] thanks apergos, 'til next time... [08:42:37] see ya! [08:43:45] (03PS1) 10Marostegui: mariadb: Decommission db1105 [puppet] - 10https://gerrit.wikimedia.org/r/900197 (https://phabricator.wikimedia.org/T331874) [08:43:58] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission for hosts db1105.eqiad.wmnet [08:48:05] !log marostegui@cumin1001 START - Cookbook sre.dns.netbox [08:48:45] (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db1105 [puppet] - 10https://gerrit.wikimedia.org/r/900197 (https://phabricator.wikimedia.org/T331874) (owner: 10Marostegui) [08:49:05] 10SRE-swift-storage, 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 10): Storage request: swift s3 bucket for mediawiki-page-content-change-enrichment checkpointing - https://phabricator.wikimedia.org/T330693 (10dcausse) Hi everyone and sorry to jump into this conversion but just wanted t... [08:49:55] !log marostegui@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1105.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1001" [08:50:22] 10ops-eqiad, 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission db1105.eqiad.wmnet - https://phabricator.wikimedia.org/T331874 (10Marostegui) a:05Marostegui→03None [08:51:02] !log marostegui@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db1105.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - marostegui@cumin1001" [08:51:03] !log marostegui@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:51:03] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db1105.eqiad.wmnet [08:51:10] 10ops-eqiad, 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission db1105.eqiad.wmnet - https://phabricator.wikimedia.org/T331874 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by marostegui@cumin1001 for hosts: `db1105.eqiad.wmnet` - db1105.eqiad.wmnet (**WARN**) - Downtimed... [08:51:14] 10ops-eqiad, 10DBA, 10decommission-hardware, 10Patch-For-Review: decommission db1105.eqiad.wmnet - https://phabricator.wikimedia.org/T331874 (10Marostegui) This is ready for DC-Ops [08:51:27] 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review: decommission db1105.eqiad.wmnet - https://phabricator.wikimedia.org/T331874 (10Marostegui) [09:00:06] (CirrusSearchJobQueueBacklogTooBig) firing: CirrusSearch job topic codfw.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 234k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig [09:06:31] (03CR) 10Filippo Giunchedi: [C: 03+2] perf: fix webperf metric names (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/899506 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [09:20:06] (CirrusSearchJobQueueBacklogTooBig) resolved: CirrusSearch job topic codfw.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 203.2k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig [09:26:29] 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10Marostegui) [09:42:28] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: / (spec from root) timed out before a response was received: /{domain}/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [09:43:30] 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 9 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10fgiunchedi) [09:44:30] (03CR) 10Ayounsi: Management routers: move ssh port to 2222 (032 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/890811 (https://phabricator.wikimedia.org/T277438) (owner: 10Ayounsi) [09:46:06] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [09:55:11] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops: Find a sensible way to redirect traffic to mw-on-k8s - https://phabricator.wikimedia.org/T331318 (10Joe) After some thought, I think the most maintainable way to do this is to add an additional lua module to the maps for api/appservers. Specifically, this would... [10:00:05] mvolz: May I have your attention please! Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230316T1000) [10:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230316T1000) [10:03:57] 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 9 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10fgiunchedi) [10:09:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1179 to move it to x1', diff saved to https://phabricator.wikimedia.org/P45885 and previous config saved to /var/cache/conftool/dbconfig/20230316-100945-root.json [10:09:48] 10SRE, 10MW-on-K8s, 10Traffic, 10Wikidata, and 2 others: Migrate testwikidata to Kubernetes - https://phabricator.wikimedia.org/T331268 (10Clement_Goubert) [10:10:00] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 3 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Clement_Goubert) [10:10:32] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM! In theory the cloudlb (and cloudgw), would not need the static route, as they will learn it through BGP from the cloudsw. The clou" [puppet] - 10https://gerrit.wikimedia.org/r/899616 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez) [10:12:32] 10SRE, 10Infrastructure-Foundations, 10serviceops-radar, 10Datacenter-Switchover: Cookbooks that do DNS discovery change should check recdns - https://phabricator.wikimedia.org/T332009 (10Joe) I am not sure what would be the goal of checking the dns recursors in that situation, as far as running the cookbo... [10:15:04] 10SRE-swift-storage, 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 10): Storage request: swift s3 bucket for mediawiki-page-content-change-enrichment checkpointing - https://phabricator.wikimedia.org/T330693 (10gmodena) >>! In T330693#8698639, @MatthewVernon wrote: > This is a k8s applic... [10:15:30] (03CR) 10Effie Mouzeli: [C: 03+1] Assign mediawiki roles to mw2420-mw2451 [puppet] - 10https://gerrit.wikimedia.org/r/896063 (https://phabricator.wikimedia.org/T326363) (owner: 10Clément Goubert) [10:25:00] 10SRE-swift-storage, 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 10): Storage request: swift s3 bucket for mediawiki-page-content-change-enrichment checkpointing - https://phabricator.wikimedia.org/T330693 (10gmodena) Hi Eric, > I know; I didn't mean for this to come across as an indi... [10:25:50] 10SRE, 10Infrastructure-Foundations, 10netops: Invesitgate requirement for 'session-mode auatomatic' on EVPN iBGP peerings - https://phabricator.wikimedia.org/T332295 (10cmooney) p:05Triage→03Low [10:26:27] !log elukey@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [10:28:52] !log elukey@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [10:28:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:29:56] !log elukey@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [10:30:14] !log elukey@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [10:31:12] !log elukey@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [10:31:18] !log elukey@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [10:31:48] !log elukey@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [10:31:53] !log elukey@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [10:31:57] !log vgutierrez@cumin1001 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_codfw [10:32:13] !log vgutierrez@cumin1001 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_codfw [10:32:41] !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 32 hosts with reason: new_install [10:32:54] !log elukey@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [10:33:03] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 32 hosts with reason: new_install [10:33:12] 10SRE, 10serviceops, 10Patch-For-Review: mw2420-mw2451 service implementation tracking - https://phabricator.wikimedia.org/T326363 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=c5ba1cf2-f027-43f9-8672-b4eb30f98ddc) set by cgoubert@cumin1001 for 1:00:00 on 32 host(s) and their services w... [10:33:15] (03CR) 10Clément Goubert: [C: 03+2] Assign mediawiki roles to mw2420-mw2451 [puppet] - 10https://gerrit.wikimedia.org/r/896063 (https://phabricator.wikimedia.org/T326363) (owner: 10Clément Goubert) [10:33:38] !log elukey@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [10:33:47] (03PS1) 10Filippo Giunchedi: DNM: test alertmanager depool for prometheus1006 [puppet] - 10https://gerrit.wikimedia.org/r/900238 (https://phabricator.wikimedia.org/T331449) [10:33:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:34:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:36:34] (03PS1) 10Ilias Sarantopoulos: ml-services: allow scale-to-zero for staging deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/900239 (https://phabricator.wikimedia.org/T325763) [10:37:20] (03PS2) 10Ilias Sarantopoulos: ml-services: allow scale-to-zero for staging deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/900239 (https://phabricator.wikimedia.org/T325763) [10:37:45] (03PS1) 10Elukey: admin_ng: limit the experimental namespace to ml-staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/900240 [10:37:52] !log vgutierrez@cumin1001 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_eqsin [10:38:04] !log vgutierrez@cumin1001 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_eqsin [10:38:54] (03PS2) 10Elukey: admin_ng: limit the experimental namespace to ml-staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/900240 [10:39:13] (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:39:52] !log elukey@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [10:40:28] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:40:37] !log elukey@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [10:40:56] (03CR) 10Cathal Mooney: [C: 03+2] Move cloudsw prefix-list filters from templates to YAML [homer/public] - 10https://gerrit.wikimedia.org/r/898684 (https://phabricator.wikimedia.org/T327919) (owner: 10Cathal Mooney) [10:42:03] !log elukey@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [10:42:19] !log elukey@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [10:44:13] (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:44:44] (03CR) 10CI reject: [V: 04-1] admin_ng: limit the experimental namespace to ml-staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/900240 (owner: 10Elukey) [10:45:52] (03CR) 10Elukey: [C: 03+2] ml-services: allow scale-to-zero for staging deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/900239 (https://phabricator.wikimedia.org/T325763) (owner: 10Ilias Sarantopoulos) [10:46:09] I might have broken Jenkins somehow [10:46:16] (03Merged) 10jenkins-bot: Move cloudsw prefix-list filters from templates to YAML [homer/public] - 10https://gerrit.wikimedia.org/r/898684 (https://phabricator.wikimedia.org/T327919) (owner: 10Cathal Mooney) [10:47:46] well maybe not :] [10:48:18] (03PS6) 10Cathal Mooney: Add automation for EVPN BGP peerings [homer/public] - 10https://gerrit.wikimedia.org/r/894741 (https://phabricator.wikimedia.org/T327934) [10:48:20] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01271 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [10:48:25] (03CR) 10CI reject: [V: 04-1] Add automation for EVPN BGP peerings [homer/public] - 10https://gerrit.wikimedia.org/r/894741 (https://phabricator.wikimedia.org/T327934) (owner: 10Cathal Mooney) [10:49:26] (03PS4) 10Jameel Kaisar: Create and deploy per-CDN-site DNS domains [dns] - 10https://gerrit.wikimedia.org/r/899214 [10:49:36] PROBLEM - Check systemd state on puppetdb2002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_stockpile_queue.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:50:00] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_codfw [10:51:06] (CirrusSearchJobQueueBacklogTooBig) firing: CirrusSearch job topic codfw.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 204.8k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig [10:51:28] RECOVERY - Check systemd state on puppetdb2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:52:27] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_codfw [10:55:00] (03PS1) 10Filippo Giunchedi: traffic: deploy alerts to 'ops' instance [alerts] - 10https://gerrit.wikimedia.org/r/900241 (https://phabricator.wikimedia.org/T309182) [10:55:51] (03PS1) 10Jbond: sre.netbox.deploy-extras: create a cookbook to deploy netbox-extras [cookbooks] - 10https://gerrit.wikimedia.org/r/900242 (https://phabricator.wikimedia.org/T310590) [10:56:21] (03PS7) 10Cathal Mooney: Add automation for EVPN BGP peerings [homer/public] - 10https://gerrit.wikimedia.org/r/894741 (https://phabricator.wikimedia.org/T327934) [10:56:31] (03PS10) 10Nicolas Fraison: spark-operator: enable spark operator mutation webhook [deployment-charts] - 10https://gerrit.wikimedia.org/r/897895 (https://phabricator.wikimedia.org/T331858) [10:56:33] (03PS3) 10Nicolas Fraison: spark: Allow communication from spark pods to HDFS/Hive [deployment-charts] - 10https://gerrit.wikimedia.org/r/899630 (https://phabricator.wikimedia.org/T331859) [10:56:45] (03CR) 10Cathal Mooney: Add automation for EVPN BGP peerings (034 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/894741 (https://phabricator.wikimedia.org/T327934) (owner: 10Cathal Mooney) [10:57:31] (03CR) 10Filippo Giunchedi: "Note that this might uncover some problems with the alerts once deployed, I'll followup with further patches to fix the alerts" [alerts] - 10https://gerrit.wikimedia.org/r/900241 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [10:58:13] (03CR) 10CI reject: [V: 04-1] sre.netbox.deploy-extras: create a cookbook to deploy netbox-extras [cookbooks] - 10https://gerrit.wikimedia.org/r/900242 (https://phabricator.wikimedia.org/T310590) (owner: 10Jbond) [10:59:02] (03PS2) 10Jbond: sre.netbox.deploy-extras: create a cookbook to deploy netbox-extras [cookbooks] - 10https://gerrit.wikimedia.org/r/900242 (https://phabricator.wikimedia.org/T310590) [11:00:18] (03PS11) 10Nicolas Fraison: spark-operator: enable spark operator mutation webhook [deployment-charts] - 10https://gerrit.wikimedia.org/r/897895 (https://phabricator.wikimedia.org/T331858) [11:00:20] (03PS4) 10Nicolas Fraison: spark: Allow communication from spark pods to HDFS/Hive [deployment-charts] - 10https://gerrit.wikimedia.org/r/899630 (https://phabricator.wikimedia.org/T331859) [11:01:07] (03CR) 10CI reject: [V: 04-1] sre.netbox.deploy-extras: create a cookbook to deploy netbox-extras [cookbooks] - 10https://gerrit.wikimedia.org/r/900242 (https://phabricator.wikimedia.org/T310590) (owner: 10Jbond) [11:01:09] (03PS3) 10Jbond: sre.netbox.deploy-extras: create a cookbook to deploy netbox-extras [cookbooks] - 10https://gerrit.wikimedia.org/r/900242 (https://phabricator.wikimedia.org/T310590) [11:03:20] (03CR) 10CI reject: [V: 04-1] sre.netbox.deploy-extras: create a cookbook to deploy netbox-extras [cookbooks] - 10https://gerrit.wikimedia.org/r/900242 (https://phabricator.wikimedia.org/T310590) (owner: 10Jbond) [11:03:22] (03PS4) 10Jbond: sre.netbox.deploy-extras: create a cookbook to deploy netbox-extras [cookbooks] - 10https://gerrit.wikimedia.org/r/900242 (https://phabricator.wikimedia.org/T310590) [11:04:22] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes:weight=4; selector: service=thumbor,name=kubernetes101[0123].eqiad.wmnet [11:04:23] (03CR) 10Vgutierrez: [C: 03+1] traffic: deploy alerts to 'ops' instance [alerts] - 10https://gerrit.wikimedia.org/r/900241 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [11:05:46] (03CR) 10CI reject: [V: 04-1] sre.netbox.deploy-extras: create a cookbook to deploy netbox-extras [cookbooks] - 10https://gerrit.wikimedia.org/r/900242 (https://phabricator.wikimedia.org/T310590) (owner: 10Jbond) [11:06:01] !log vgutierrez@cumin1001 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_drmrs [11:06:07] !log vgutierrez@cumin1001 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_drmrs [11:06:37] jclark-ctr: cables order for ya will ship from California on 4/11 [11:07:24] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_eqsin [11:07:48] robh: you just messed my internal clock by chatting this early in the day :P [11:08:25] (03CR) 10Cathal Mooney: [C: 03+1] cloudlb: introduce cloud-private IP address [puppet] - 10https://gerrit.wikimedia.org/r/899569 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez) [11:09:01] (03CR) 10JMeybohm: "Giuseppe added some functionality to check/verify the state of discovery DNS, maybe that could be of use here as well:" [cookbooks] - 10https://gerrit.wikimedia.org/r/849130 (owner: 10Jbond) [11:09:45] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloud_private_subnet: add route to other similar subnets [puppet] - 10https://gerrit.wikimedia.org/r/899616 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez) [11:10:17] !log hnowlan@puppetmaster1001 conftool action : set/weight=2; selector: service=thumbor,name=kubernetes101[0123].eqiad.wmnet [11:10:33] (03CR) 10Filippo Giunchedi: [C: 03+2] traffic: deploy alerts to 'ops' instance [alerts] - 10https://gerrit.wikimedia.org/r/900241 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [11:11:06] (CirrusSearchJobQueueBacklogTooBig) resolved: CirrusSearch job topic codfw.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 202.4k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig [11:16:33] !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on 32 hosts with reason: new_install [11:16:56] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on 32 hosts with reason: new_install [11:17:02] 10SRE, 10serviceops: mw2420-mw2451 service implementation tracking - https://phabricator.wikimedia.org/T326363 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=17f33514-0b87-4f50-abfa-6cd2e1548410) set by cgoubert@cumin1001 for 5:00:00 on 32 host(s) and their services with reason: new_instal... [11:20:00] (03PS5) 10Jbond: sre.netbox.deploy-extras: create a cookbook to deploy netbox-extras [cookbooks] - 10https://gerrit.wikimedia.org/r/900242 (https://phabricator.wikimedia.org/T310590) [11:23:21] 10SRE-swift-storage, 10MediaWiki-File-management: FileBackendMultiWrite multi-dc and thumbnail handling - https://phabricator.wikimedia.org/T331138 (10kostajh) >>! In T331138#8698560, @Xover wrote: >>>! In T331138#8676812, @thcipriani wrote: >> I checked [[ https://www.mediawiki.org/wiki/Developers/Maintainers... [11:24:23] bleh wrong channel cuz its too early heh [11:24:27] (03PS1) 10Clément Goubert: scap: Fix bootstrap-scap-target.sh exec [puppet] - 10https://gerrit.wikimedia.org/r/900252 [11:27:12] !log hnowlan@puppetmaster1001 conftool action : set/weight=3; selector: service=thumbor,name=kubernetes101[0123].eqiad.wmnet [11:29:07] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_drmrs [11:29:26] (03CR) 10Elukey: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/900240 (owner: 10Elukey) [11:30:40] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10aborrero) >>! In T327919#8699523, @cmooney wrote: > > In terms of the move we need to work with @aborr... [11:30:55] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_drmrs [11:32:06] !log vgutierrez@cumin1001 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_esams [11:32:10] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_eqsin [11:32:16] !log vgutierrez@cumin1001 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_esams [11:36:08] (03CR) 10Clément Goubert: [C: 03+2] scap: Fix bootstrap-scap-target.sh exec [puppet] - 10https://gerrit.wikimedia.org/r/900252 (owner: 10Clément Goubert) [11:36:14] (03CR) 10Jbond: sre.{ganeti,hosts}.reimage: Confirm with hostname (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/899772 (https://phabricator.wikimedia.org/T332202) (owner: 10BCornwall) [11:37:27] !log hnowlan@puppetmaster1001 conftool action : set/weight=4; selector: service=thumbor,name=kubernetes101[0123].eqiad.wmnet [11:39:03] (03PS1) 10Arturo Borrero Gonzalez: cloud_private_subnet: don't use CIDR for gateway parameter in static route [puppet] - 10https://gerrit.wikimedia.org/r/900262 (https://phabricator.wikimedia.org/T324992) [11:41:00] (03CR) 10CI reject: [V: 04-1] cloud_private_subnet: don't use CIDR for gateway parameter in static route [puppet] - 10https://gerrit.wikimedia.org/r/900262 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez) [11:42:16] (03PS1) 10Cathal Mooney: Merge branch 'master' of ssh://gerrit.wikimedia.org:29418/operations/homer/public [homer/public] - 10https://gerrit.wikimedia.org/r/900263 [11:42:18] (03PS1) 10Cathal Mooney: Add protocol direct to Cloud_outfilter protocols [homer/public] - 10https://gerrit.wikimedia.org/r/900264 (https://phabricator.wikimedia.org/T327919) [11:43:10] (03Abandoned) 10Cathal Mooney: Merge branch 'master' of ssh://gerrit.wikimedia.org:29418/operations/homer/public [homer/public] - 10https://gerrit.wikimedia.org/r/900263 (owner: 10Cathal Mooney) [11:43:17] !log hnowlan@puppetmaster1001 conftool action : set/pooled=inactive; selector: service=thumbor,name=kubernetes101[0123].eqiad.wmnet [11:43:29] (03CR) 10Cathal Mooney: [C: 03+2] Add protocol direct to Cloud_outfilter protocols [homer/public] - 10https://gerrit.wikimedia.org/r/900264 (https://phabricator.wikimedia.org/T327919) (owner: 10Cathal Mooney) [11:47:43] (03PS2) 10Arturo Borrero Gonzalez: cloud_private_subnet: don't use CIDR for gateway parameter in static route [puppet] - 10https://gerrit.wikimedia.org/r/900262 (https://phabricator.wikimedia.org/T324992) [11:52:03] (03CR) 10Volans: Management routers: move ssh port to 2222 (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/890811 (https://phabricator.wikimedia.org/T277438) (owner: 10Ayounsi) [11:54:19] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_esams [11:54:47] RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.002937 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [11:56:24] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_esams [11:56:25] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloud_private_subnet: don't use CIDR for gateway parameter in static route [puppet] - 10https://gerrit.wikimedia.org/r/900262 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez) [11:56:37] !log vgutierrez@cumin1001 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_eqiad [11:56:45] !log vgutierrez@cumin1001 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_eqiad [11:57:45] (03PS2) 10Cathal Mooney: Add protocol direct to Cloud_outfilter protocols [homer/public] - 10https://gerrit.wikimedia.org/r/900264 (https://phabricator.wikimedia.org/T327919) [11:58:39] (03CR) 10JMeybohm: k8s: Force to be explicit about k8s and calico versions (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/899641 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm) [12:04:25] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:04:53] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:05:54] !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 32 hosts with reason: new_install [12:08:44] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 32 hosts with reason: new_install [12:08:50] 10SRE, 10serviceops: mw2420-mw2451 service implementation tracking - https://phabricator.wikimedia.org/T326363 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=f7f64d19-c64a-4fb5-a8ab-f3218dfd9862) set by cgoubert@cumin1001 for 1:00:00 on 32 host(s) and their services with reason: new_instal... [12:09:26] (03PS1) 10Slyngshede: Password reset - Allow users to request a password reset. [software/bitu] - 10https://gerrit.wikimedia.org/r/900277 [12:12:03] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 13 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:14:17] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_eqiad [12:14:25] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:16:20] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_eqiad [12:24:21] (03CR) 10Jbond: [C: 03+1] "lgtm just a typo and a minor optional nit" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/889958 (https://phabricator.wikimedia.org/T310590) (owner: 10Ayounsi) [12:25:11] (03CR) 10Nicolas Fraison: spark-operator: enable spark operator mutation webhook (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/897895 (https://phabricator.wikimedia.org/T331858) (owner: 10Nicolas Fraison) [12:30:56] (03PS1) 10Slyngshede: get_single_object - get modified timestamp [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/900304 [12:31:15] (03CR) 10CI reject: [V: 04-1] get_single_object - get modified timestamp [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/900304 (owner: 10Slyngshede) [12:32:08] (03PS2) 10Slyngshede: get_single_object - get modified timestamp [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/900304 [12:34:19] (03PS13) 10KartikMistry: WIP: Add new self hosted machinetranslation service (MinT) [deployment-charts] - 10https://gerrit.wikimedia.org/r/897634 (https://phabricator.wikimedia.org/T331505) [12:38:06] (03CR) 10Ssingh: sre.{ganeti,hosts}.reimage: Confirm with hostname (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/899772 (https://phabricator.wikimedia.org/T332202) (owner: 10BCornwall) [13:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230316T1300) [13:00:05] xSavitar and raynor: A patch you scheduled for Mobileapps/RESTBase/Wikifeeds is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: That opportune time is upon us again. Time for a UTC afternoon backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230316T1300). [13:00:05] kostajh: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:20] hello [13:00:37] (I can deploy in 5m) [13:00:57] I don't mind deploying myself [13:01:04] (03PS5) 10JMeybohm: k8s: Make profile::kubernetes::pki::intermediate mandatory [puppet] - 10https://gerrit.wikimedia.org/r/899642 (https://phabricator.wikimedia.org/T328291) [13:01:06] (03PS3) 10JMeybohm: k8s: Remove 1.16 related code [puppet] - 10https://gerrit.wikimedia.org/r/899652 (https://phabricator.wikimedia.org/T328291) [13:01:12] kostajh: go ahead! :) [13:02:44] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900196 (https://phabricator.wikimedia.org/T317813) (owner: 10Kosta Harlan) [13:02:46] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/894593 (owner: 10Kosta Harlan) [13:02:54] (03PS2) 10Kosta Harlan: GrowthExperiments: Remove unused GENewImpactD3Enabled flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/894593 [13:03:02] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900196 (https://phabricator.wikimedia.org/T317813) (owner: 10Kosta Harlan) [13:03:04] (03CR) 10TrainBranchBot: "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/894593 (owner: 10Kosta Harlan) [13:03:58] TheresNoTime: kostajh If you get issues with mw24[20-52] that's my fault, they're being commissioned. They shouldn't cause issues because they're not pooled but I'd rather give you a heads up. [13:03:59] (03Merged) 10jenkins-bot: GrowthExperiments: Enable LevelingUp features on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900196 (https://phabricator.wikimedia.org/T317813) (owner: 10Kosta Harlan) [13:04:41] claime: thanks for letting me know [13:05:39] !log kharlan@deploy2002 Started scap: Backport for [[gerrit:900196|GrowthExperiments: Enable LevelingUp features on testwiki (T317813)]] [13:05:45] T317813: [EPIC] Positive Reinforcement: Leveling Up - https://phabricator.wikimedia.org/T317813 [13:07:14] !log kharlan@deploy2002 kharlan: Backport for [[gerrit:900196|GrowthExperiments: Enable LevelingUp features on testwiki (T317813)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet [13:12:20] (03CR) 10JMeybohm: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/899641 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm) [13:14:45] (03PS6) 10JMeybohm: k8s: Force to be explicit about k8s and calico versions [puppet] - 10https://gerrit.wikimedia.org/r/899641 (https://phabricator.wikimedia.org/T328291) [13:14:48] (03PS6) 10JMeybohm: k8s: Make profile::kubernetes::pki::intermediate mandatory [puppet] - 10https://gerrit.wikimedia.org/r/899642 (https://phabricator.wikimedia.org/T328291) [13:14:49] (03PS4) 10JMeybohm: k8s: Remove 1.16 related code [puppet] - 10https://gerrit.wikimedia.org/r/899652 (https://phabricator.wikimedia.org/T328291) [13:15:28] !log kharlan@deploy2002 Finished scap: Backport for [[gerrit:900196|GrowthExperiments: Enable LevelingUp features on testwiki (T317813)]] (duration: 09m 48s) [13:15:33] T317813: [EPIC] Positive Reinforcement: Leveling Up - https://phabricator.wikimedia.org/T317813 [13:15:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:17:23] (03Abandoned) 10JMeybohm: k8s: Force to be explicit about k8s and calico versions [puppet] - 10https://gerrit.wikimedia.org/r/899661 (https://phabricator.wikimedia.org/T328291) (owner: 10Jbond) [13:19:48] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (NOOP 16 DIFF 15): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40163/console" [puppet] - 10https://gerrit.wikimedia.org/r/899641 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm) [13:20:24] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 6 NOOP 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40164/console" [puppet] - 10https://gerrit.wikimedia.org/r/899642 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm) [13:20:34] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40165/console" [puppet] - 10https://gerrit.wikimedia.org/r/899652 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm) [13:20:58] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:22:24] (03PS3) 10Kosta Harlan: GrowthExperiments: Remove unused GENewImpactD3Enabled flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/894593 [13:22:34] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/894593 (owner: 10Kosta Harlan) [13:23:30] 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 9 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10fgiunchedi) [13:23:44] (03CR) 10Filippo Giunchedi: [C: 03+1] wmflib::service::probe::module_options: add tests [puppet] - 10https://gerrit.wikimedia.org/r/899643 (owner: 10Jbond) [13:23:50] (03CR) 10Filippo Giunchedi: [C: 03+1] "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/899643 (owner: 10Jbond) [13:26:34] (03CR) 10JMeybohm: spark-operator: enable spark operator mutation webhook (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/897895 (https://phabricator.wikimedia.org/T331858) (owner: 10Nicolas Fraison) [13:26:36] (03Merged) 10jenkins-bot: GrowthExperiments: Remove unused GENewImpactD3Enabled flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/894593 (owner: 10Kosta Harlan) [13:27:01] !log kharlan@deploy2002 Started scap: Backport for [[gerrit:894593|GrowthExperiments: Remove unused GENewImpactD3Enabled flag]] [13:27:34] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-collab: Q3:rack/setup/install gerrit1003 - https://phabricator.wikimedia.org/T326366 (10Jclark-ctr) gerrit1003 B5 U13 port 4 Cableid 2988 [13:27:57] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-collab: Q3:rack/setup/install gerrit1003 - https://phabricator.wikimedia.org/T326366 (10Jclark-ctr) [13:28:35] !log kharlan@deploy2002 kharlan: Backport for [[gerrit:894593|GrowthExperiments: Remove unused GENewImpactD3Enabled flag]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [13:28:36] 10SRE, 10SRE Observability, 10User-fgiunchedi: Webrequest Sampled Live on Superset shows data from only upload and not text CDN nodes - https://phabricator.wikimedia.org/T331801 (10fgiunchedi) [13:28:58] (03PS1) 10Jaime Nuche: docker::gc: update configuration to use latest version of images [puppet] - 10https://gerrit.wikimedia.org/r/900312 [13:29:00] (03PS1) 10Jaime Nuche: deployment_server: clean up older images using systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/900313 (https://phabricator.wikimedia.org/T329678) [13:30:31] (03CR) 10David Caro: [C: 03+1] "Note that this will need changing the secrets in the puppetmasters to adapt to the new hiera key" [puppet] - 10https://gerrit.wikimedia.org/r/899724 (https://phabricator.wikimedia.org/T316323) (owner: 10FNegri) [13:32:11] (03CR) 10David Caro: [C: 03+1] openstack: create openstack-ansible evaluation role [puppet] - 10https://gerrit.wikimedia.org/r/895789 (https://phabricator.wikimedia.org/T326758) (owner: 10Arturo Borrero Gonzalez) [13:33:31] (03Abandoned) 10Arturo Borrero Gonzalez: openstack: create openstack-ansible evaluation role [puppet] - 10https://gerrit.wikimedia.org/r/895789 (https://phabricator.wikimedia.org/T326758) (owner: 10Arturo Borrero Gonzalez) [13:34:45] !log kharlan@deploy2002 Finished scap: Backport for [[gerrit:894593|GrowthExperiments: Remove unused GENewImpactD3Enabled flag]] (duration: 07m 44s) [13:35:04] !log UTC afternoon deploys done [13:35:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:17] (03PS1) 10Jbond: netbox: fix minor lint issues and add test [puppet] - 10https://gerrit.wikimedia.org/r/900315 [13:35:19] (03PS1) 10Jbond: Netbox: introduce support for validators [puppet] - 10https://gerrit.wikimedia.org/r/900316 (https://phabricator.wikimedia.org/T310590) [13:35:21] (03PS1) 10Jbond: netbox: add validators to canary host [puppet] - 10https://gerrit.wikimedia.org/r/900317 (https://phabricator.wikimedia.org/T310590) [13:35:23] (03PS1) 10Jbond: netbox: add validators to production host [puppet] - 10https://gerrit.wikimedia.org/r/900318 (https://phabricator.wikimedia.org/T310590) [13:35:43] (03CR) 10Jbond: "i think it would be better to control this via puppet so that we can for instance have different (new) validators on netbox-next vs netbox" [puppet] - 10https://gerrit.wikimedia.org/r/889959 (https://phabricator.wikimedia.org/T310590) (owner: 10Ayounsi) [13:36:11] (03CR) 10CI reject: [V: 04-1] netbox: fix minor lint issues and add test [puppet] - 10https://gerrit.wikimedia.org/r/900315 (owner: 10Jbond) [13:36:18] (03CR) 10CI reject: [V: 04-1] Netbox: introduce support for validators [puppet] - 10https://gerrit.wikimedia.org/r/900316 (https://phabricator.wikimedia.org/T310590) (owner: 10Jbond) [13:38:04] (03CR) 10Jbond: [C: 03+2] wmflib::service::probe::module_options: add tests [puppet] - 10https://gerrit.wikimedia.org/r/899643 (owner: 10Jbond) [13:43:58] (03PS5) 10Cathal Mooney: Adjust BFD Icinga check to handle SNMP connection failure [puppet] - 10https://gerrit.wikimedia.org/r/898849 (https://phabricator.wikimedia.org/T332080) [13:45:06] (03CR) 10Ayounsi: [C: 03+1] Adjust BFD Icinga check to handle SNMP connection failure [puppet] - 10https://gerrit.wikimedia.org/r/898849 (https://phabricator.wikimedia.org/T332080) (owner: 10Cathal Mooney) [13:46:47] (03CR) 10Cathal Mooney: Adjust BFD Icinga check to handle SNMP connection failure (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/898849 (https://phabricator.wikimedia.org/T332080) (owner: 10Cathal Mooney) [13:46:49] (03CR) 10Cathal Mooney: [C: 03+2] Adjust BFD Icinga check to handle SNMP connection failure [puppet] - 10https://gerrit.wikimedia.org/r/898849 (https://phabricator.wikimedia.org/T332080) (owner: 10Cathal Mooney) [13:51:28] (03CR) 10Filippo Giunchedi: [C: 03+1] wmcs: update ceph alerts dashboard [alerts] - 10https://gerrit.wikimedia.org/r/895220 (owner: 10David Caro) [13:51:58] (03PS3) 10Elukey: admin_ng: limit the experimental namespace to ml-staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/900240 [13:52:05] 10SRE, 10Infrastructure-Foundations, 10netops, 10observability, 10Patch-For-Review: BFD Status Check Fails when device is unavailable - https://phabricator.wikimedia.org/T332080 (10cmooney) 05Open→03Resolved a:03cmooney [13:55:07] (03CR) 10CI reject: [V: 04-1] admin_ng: limit the experimental namespace to ml-staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/900240 (owner: 10Elukey) [13:55:29] (03CR) 10Ayounsi: "Overall LGTM." [cookbooks] - 10https://gerrit.wikimedia.org/r/900242 (https://phabricator.wikimedia.org/T310590) (owner: 10Jbond) [13:55:46] 10SRE, 10Infrastructure-Foundations, 10netops, 10observability, 10Patch-For-Review: SNMP Network Checks throw exception when device is unreachable - https://phabricator.wikimedia.org/T332080 (10cmooney) [13:56:12] 10SRE, 10Infrastructure-Foundations, 10netops, 10observability, 10Patch-For-Review: SNMP Network Checks throw exception when device is unreachable - https://phabricator.wikimedia.org/T332080 (10cmooney) 05Resolved→03Open [13:56:36] (03CR) 10Cathal Mooney: Adjust OSPF Icinga check to ignore OSPFv3 if zero ints configured (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/899609 (https://phabricator.wikimedia.org/T315053) (owner: 10Cathal Mooney) [13:56:39] (03PS1) 10Andrew Bogott: rbd2backy2: log 'expire' string before trying to parse it [puppet] - 10https://gerrit.wikimedia.org/r/900323 [13:56:48] (03PS3) 10Cathal Mooney: Adjust OSPF Icinga check to ignore OSPFv3 if zero ints configured [puppet] - 10https://gerrit.wikimedia.org/r/899609 (https://phabricator.wikimedia.org/T315053) [13:56:58] (03PS1) 10Ssingh: auditd: remove obsolete buster code [puppet] - 10https://gerrit.wikimedia.org/r/900324 (https://phabricator.wikimedia.org/T321309) [13:57:56] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40168/console" [puppet] - 10https://gerrit.wikimedia.org/r/900324 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [13:58:54] (03CR) 10CI reject: [V: 04-1] rbd2backy2: log 'expire' string before trying to parse it [puppet] - 10https://gerrit.wikimedia.org/r/900323 (owner: 10Andrew Bogott) [13:59:15] (03CR) 10Cathal Mooney: [C: 03+2] Adjust OSPF Icinga check to ignore OSPFv3 if zero ints configured [puppet] - 10https://gerrit.wikimedia.org/r/899609 (https://phabricator.wikimedia.org/T315053) (owner: 10Cathal Mooney) [13:59:35] (03CR) 10Ssingh: [V: 03+1 C: 03+2] auditd: remove obsolete buster code [puppet] - 10https://gerrit.wikimedia.org/r/900324 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [14:01:03] 10SRE-swift-storage, 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 10): Storage request: swift s3 bucket for mediawiki-page-content-change-enrichment checkpointing - https://phabricator.wikimedia.org/T330693 (10Eevans) >>! In T330693#8701662, @gmodena wrote: >>> [ ... ] >>> >>> How woul... [14:01:27] (03PS2) 10Andrew Bogott: rbd2backy2: log 'expire' string before trying to parse it [puppet] - 10https://gerrit.wikimedia.org/r/900323 [14:06:05] !log ALTER-ing image_suggestions.suggestion table — T328670 [14:06:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:11] T328670: Add section title column to image_suggestions.suggestions table schema - https://phabricator.wikimedia.org/T328670 [14:07:36] (03CR) 10David Caro: rbd2backy2: log 'expire' string before trying to parse it (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/900323 (owner: 10Andrew Bogott) [14:08:01] (03PS3) 10Ssingh: dnsrecursor: drop support for buster and pdns-recursor < 4.6 [puppet] - 10https://gerrit.wikimedia.org/r/898957 (https://phabricator.wikimedia.org/T332083) [14:09:50] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40169/console" [puppet] - 10https://gerrit.wikimedia.org/r/898957 (https://phabricator.wikimedia.org/T332083) (owner: 10Ssingh) [14:13:21] (03CR) 10CDanis: [C: 03+1] Create and deploy per-CDN-site DNS domains [dns] - 10https://gerrit.wikimedia.org/r/899214 (owner: 10Jameel Kaisar) [14:13:59] (03CR) 10Ssingh: [V: 03+1] dnsrecursor: drop support for buster and pdns-recursor < 4.6 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/898957 (https://phabricator.wikimedia.org/T332083) (owner: 10Ssingh) [14:17:23] (03PS4) 10Elukey: admin_ng: limit the experimental namespace to ml-staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/900240 [14:20:40] (03CR) 10Raymond Ndibe: [C: 03+1] "LGTM" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/897830 (owner: 10Slavina Stefanova) [14:20:59] (03CR) 10Raymond Ndibe: [C: 03+2] d/changelog: prepare for 0.92 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/897830 (owner: 10Slavina Stefanova) [14:21:22] (03PS8) 10Ayounsi: Add automation for EVPN BGP peerings [homer/public] - 10https://gerrit.wikimedia.org/r/894741 (https://phabricator.wikimedia.org/T327934) (owner: 10Cathal Mooney) [14:22:36] (03Merged) 10jenkins-bot: d/changelog: prepare for 0.92 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/897830 (owner: 10Slavina Stefanova) [14:22:37] (03CR) 10Ayounsi: [C: 03+1] "ship it!" [homer/public] - 10https://gerrit.wikimedia.org/r/894741 (https://phabricator.wikimedia.org/T327934) (owner: 10Cathal Mooney) [14:26:24] (03CR) 10Elukey: [C: 03+2] admin_ng: limit the experimental namespace to ml-staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/900240 (owner: 10Elukey) [14:28:03] (03PS2) 10Jbond: netbox: fix minor lint issues and add test [puppet] - 10https://gerrit.wikimedia.org/r/900315 [14:30:31] (03PS1) 10Esanders: Enable DiscussionTools_visualenhancements_newsectionlink_enable on labs for testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900331 [14:31:12] !log elukey@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [14:31:16] !log elukey@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [14:37:08] (03CR) 10Jbond: [C: 03+2] netbox: fix minor lint issues and add test [puppet] - 10https://gerrit.wikimedia.org/r/900315 (owner: 10Jbond) [14:37:26] (03PS2) 10Jbond: Netbox: introduce support for validators [puppet] - 10https://gerrit.wikimedia.org/r/900316 (https://phabricator.wikimedia.org/T310590) [14:38:01] (03CR) 10CI reject: [V: 04-1] Netbox: introduce support for validators [puppet] - 10https://gerrit.wikimedia.org/r/900316 (https://phabricator.wikimedia.org/T310590) (owner: 10Jbond) [14:39:54] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Urgent: Two failed disks in ms-be2040 - https://phabricator.wikimedia.org/T331860 (10MatthewVernon) 05Open→03Resolved @Papaul thanks; the other drive has behaved itself since the reboot, so I think we're OK to leave it in place for now. [obviously it... [14:40:44] !log elukey@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [14:40:48] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [14:40:50] !log gmodena@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [14:40:57] !log elukey@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [14:42:26] (03CR) 10Ayounsi: Netbox: activate validators (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/889959 (https://phabricator.wikimedia.org/T310590) (owner: 10Ayounsi) [14:44:35] !log elukey@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [14:47:48] (03PS3) 10Jbond: Netbox: introduce support for validators [puppet] - 10https://gerrit.wikimedia.org/r/900316 (https://phabricator.wikimedia.org/T310590) [14:49:19] (03PS1) 10Herron: kafka-logging: stop kafka services on kafka-logging1001 [puppet] - 10https://gerrit.wikimedia.org/r/900336 (https://phabricator.wikimedia.org/T326419) [14:49:22] (03PS1) 10Herron: kafka-logging: bring up kafka-logging1004 with node id 1004 [puppet] - 10https://gerrit.wikimedia.org/r/900337 (https://phabricator.wikimedia.org/T326419) [14:49:38] !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 32 hosts with reason: new_install [14:50:01] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 32 hosts with reason: new_install [14:50:06] 10SRE, 10serviceops: mw2420-mw2451 service implementation tracking - https://phabricator.wikimedia.org/T326363 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=33992616-b446-4bc5-bf17-27cb8c47e8d7) set by cgoubert@cumin1001 for 1:00:00 on 32 host(s) and their services with reason: new_instal... [14:50:18] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10Papaul) @cmooney thank you for getting the table ready for the cloud nodes move. As you can see on asw... [14:50:49] (03CR) 10Ahmon Dancy: "This looks like a reasonable and simple way to go." [puppet] - 10https://gerrit.wikimedia.org/r/900313 (https://phabricator.wikimedia.org/T329678) (owner: 10Jaime Nuche) [14:51:19] (03CR) 10Ayounsi: Add validator classes for some objects (032 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/889958 (https://phabricator.wikimedia.org/T310590) (owner: 10Ayounsi) [14:51:35] (03PS6) 10Ayounsi: Add validator classes for some objects [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/889958 (https://phabricator.wikimedia.org/T310590) [14:51:59] 10SRE, 10Infrastructure-Foundations, 10netops, 10observability, 10Patch-For-Review: SNMP Network Checks throw exception when device is unreachable - https://phabricator.wikimedia.org/T332080 (10cmooney) Looks like our OSPF check already handles any exceptions: ` cmooney@alert1001:~$ ./check_ospf.py --hos... [14:54:07] (03CR) 10Ahmon Dancy: "This change is ready for review." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/900312 (owner: 10Jaime Nuche) [14:55:39] (03CR) 10Herron: "the new kafka-logging hosts are a good opportunity to re-align hostnames with node ids where they overlap. my high level plan is to set d" [puppet] - 10https://gerrit.wikimedia.org/r/900336 (https://phabricator.wikimedia.org/T326419) (owner: 10Herron) [14:57:06] (03PS6) 10Jbond: sre.netbox.deploy-extras: create a cookbook to deploy netbox-extras [cookbooks] - 10https://gerrit.wikimedia.org/r/900242 (https://phabricator.wikimedia.org/T310590) [14:57:08] (03PS1) 10Jbond: SREBatchBase: allow cookbooks to opt out of ensurable shell [cookbooks] - 10https://gerrit.wikimedia.org/r/900340 [14:57:33] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: session-c624.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:58:38] 10SRE, 10Observability-Logging, 10Release-Engineering-Team, 10Wikimedia-Logstash, 10SRE Observability (FY2022/2023-Q3): mediawiki-errors logstash dashboard's "errors over time" panel broken - https://phabricator.wikimedia.org/T332273 (10colewhite) The `MediaWiki errors over time by channel` visualization... [14:59:19] (03CR) 10CI reject: [V: 04-1] SREBatchBase: allow cookbooks to opt out of ensurable shell [cookbooks] - 10https://gerrit.wikimedia.org/r/900340 (owner: 10Jbond) [14:59:30] (03CR) 10CI reject: [V: 04-1] sre.netbox.deploy-extras: create a cookbook to deploy netbox-extras [cookbooks] - 10https://gerrit.wikimedia.org/r/900242 (https://phabricator.wikimedia.org/T310590) (owner: 10Jbond) [15:02:00] (03CR) 10Jaime Nuche: deployment_server: clean up older images using systemd timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/900313 (https://phabricator.wikimedia.org/T329678) (owner: 10Jaime Nuche) [15:02:12] (03CR) 10Ahmon Dancy: deployment_server: clean up older images using systemd timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/900313 (https://phabricator.wikimedia.org/T329678) (owner: 10Jaime Nuche) [15:02:51] (03CR) 10Jbond: "thanks updated see inline re ensurable_shell" [cookbooks] - 10https://gerrit.wikimedia.org/r/900242 (https://phabricator.wikimedia.org/T310590) (owner: 10Jbond) [15:05:07] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: session-c624.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:05:45] (03PS2) 10Jbond: SREBatchBase: allow cookbooks to opt out of ensurable shell [cookbooks] - 10https://gerrit.wikimedia.org/r/900340 [15:05:47] (03PS7) 10Jbond: sre.netbox.deploy-extras: create a cookbook to deploy netbox-extras [cookbooks] - 10https://gerrit.wikimedia.org/r/900242 (https://phabricator.wikimedia.org/T310590) [15:05:58] (03PS5) 10JMeybohm: k8s: Remove 1.16 related code [puppet] - 10https://gerrit.wikimedia.org/r/899652 (https://phabricator.wikimedia.org/T328291) [15:07:23] (03PS3) 10Jbond: SREBatchBase: allow cookbooks to opt out of ensurable shell [cookbooks] - 10https://gerrit.wikimedia.org/r/900340 [15:08:09] (03PS2) 10Jbond: netbox: add validators to canary host [puppet] - 10https://gerrit.wikimedia.org/r/900317 (https://phabricator.wikimedia.org/T310590) [15:08:27] (03PS2) 10Jbond: netbox: add validators to production host [puppet] - 10https://gerrit.wikimedia.org/r/900318 (https://phabricator.wikimedia.org/T310590) [15:09:25] !log cgoubert@cumin1001 START - Cookbook sre.hosts.remove-downtime for 32 hosts [15:09:27] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40171/console" [puppet] - 10https://gerrit.wikimedia.org/r/900317 (https://phabricator.wikimedia.org/T310590) (owner: 10Jbond) [15:09:35] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for 32 hosts [15:10:06] (03CR) 10CDanis: [C: 03+1] Create and deploy per-CDN-site DNS domains (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/899214 (owner: 10Jameel Kaisar) [15:10:09] !log disable puppet on R:class = dnsrecursor to merge CR: 898957 [15:10:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:13] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40170/console" [puppet] - 10https://gerrit.wikimedia.org/r/899652 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm) [15:10:19] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40172/console" [puppet] - 10https://gerrit.wikimedia.org/r/900318 (https://phabricator.wikimedia.org/T310590) (owner: 10Jbond) [15:11:36] !log cgoubert@cumin1001 conftool action : set/weight=30; selector: name=mw24[2345].*.codfw.wmnet,cluster=appserver [15:11:55] (03CR) 10Ssingh: [V: 03+1 C: 03+2] dnsrecursor: drop support for buster and pdns-recursor < 4.6 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/898957 (https://phabricator.wikimedia.org/T332083) (owner: 10Ssingh) [15:11:59] !log cgoubert@cumin1001 conftool action : set/weight=30; selector: name=mw24[2345].*.codfw.wmnet,cluster=api_appserver [15:12:22] jbond: ok to merge yours? [15:12:27] Jbond: netbox: fix minor lint issues and add test (7b3e6603b3) [15:12:38] !log cgoubert@cumin1001 conftool action : set/weight=25; selector: name=mw24[2345].*.codfw.wmnet,cluster=jobrunner [15:13:14] !log cgoubert@cumin1001 conftool action : set/weight=25; selector: name=mw24[2345].*.codfw.wmnet,cluster=videoscaler [15:13:43] jouncebot: nowandnext [15:13:43] No deployments scheduled for the next 0 hour(s) and 46 minute(s) [15:13:43] In 0 hour(s) and 46 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230316T1600) [15:14:11] (03PS8) 10Jbond: sre.netbox.deploy-extras: create a cookbook to deploy netbox-extras [cookbooks] - 10https://gerrit.wikimedia.org/r/900242 (https://phabricator.wikimedia.org/T310590) [15:14:16] (03CR) 10Jbond: sre.netbox.deploy-extras: create a cookbook to deploy netbox-extras (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/900242 (https://phabricator.wikimedia.org/T310590) (owner: 10Jbond) [15:14:36] (03PS9) 10Jbond: sre.netbox.deploy-extras: create a cookbook to deploy netbox-extras [cookbooks] - 10https://gerrit.wikimedia.org/r/900242 (https://phabricator.wikimedia.org/T310590) [15:15:15] !log Pooling new mw hosts mw24[20-51].codfw.wmnet - T326363 [15:15:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:20] T326363: mw2420-mw2451 service implementation tracking - https://phabricator.wikimedia.org/T326363 [15:15:50] !log cgoubert@cumin1001 conftool action : set/pooled=yes; selector: name=mw24[2345].*.codfw.wmnet,cluster=appserver [15:16:47] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: session-c624.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:17:43] (03CR) 10Volans: [C: 03+1] "LGTM although I'm not sure if we want to enable this behaviour." [cookbooks] - 10https://gerrit.wikimedia.org/r/900340 (owner: 10Jbond) [15:18:37] PROBLEM - mediawiki-installation DSH group on mw2422 is CRITICAL: Host mw2422 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:18:37] PROBLEM - mediawiki-installation DSH group on mw2423 is CRITICAL: Host mw2423 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:18:37] PROBLEM - mediawiki-installation DSH group on mw2424 is CRITICAL: Host mw2424 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:18:37] PROBLEM - mediawiki-installation DSH group on mw2426 is CRITICAL: Host mw2426 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:18:37] PROBLEM - mediawiki-installation DSH group on mw2427 is CRITICAL: Host mw2427 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:18:38] PROBLEM - mediawiki-installation DSH group on mw2428 is CRITICAL: Host mw2428 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:18:38] PROBLEM - mediawiki-installation DSH group on mw2429 is CRITICAL: Host mw2429 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:18:39] PROBLEM - mediawiki-installation DSH group on mw2430 is CRITICAL: Host mw2430 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:18:39] PROBLEM - mediawiki-installation DSH group on mw2434 is CRITICAL: Host mw2434 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:18:40] PROBLEM - mediawiki-installation DSH group on mw2435 is CRITICAL: Host mw2435 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:18:40] PROBLEM - mediawiki-installation DSH group on mw2436 is CRITICAL: Host mw2436 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:18:41] PROBLEM - mediawiki-installation DSH group on mw2437 is CRITICAL: Host mw2437 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:18:41] PROBLEM - mediawiki-installation DSH group on mw2440 is CRITICAL: Host mw2440 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:18:42] PROBLEM - mediawiki-installation DSH group on mw2442 is CRITICAL: Host mw2442 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:18:42] PROBLEM - mediawiki-installation DSH group on mw2443 is CRITICAL: Host mw2443 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:18:42] Expected. [15:18:43] PROBLEM - mediawiki-installation DSH group on mw2444 is CRITICAL: Host mw2444 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:18:43] PROBLEM - mediawiki-installation DSH group on mw2445 is CRITICAL: Host mw2445 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:18:44] PROBLEM - mediawiki-installation DSH group on mw2446 is CRITICAL: Host mw2446 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:18:44] PROBLEM - mediawiki-installation DSH group on mw2450 is CRITICAL: Host mw2450 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:18:45] PROBLEM - mediawiki-installation DSH group on mw2451 is CRITICAL: Host mw2451 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:18:46] claime: phew :P [15:19:50] !log cgoubert@cumin1001 conftool action : set/pooled=yes; selector: name=mw24[2345].*.codfw.wmnet,cluster=api_appserver [15:20:17] 10SRE, 10Infrastructure-Foundations: Migrate PKI servers to Bullseye - https://phabricator.wikimedia.org/T331696 (10Jhancock.wm) [15:20:51] 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations, 10decommission-hardware, 10Patch-For-Review: decommission pki2001.codfw.wmnet - https://phabricator.wikimedia.org/T332018 (10Jhancock.wm) 05Open→03Resolved [15:22:57] RECOVERY - mediawiki-installation DSH group on mw2422 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:22:57] RECOVERY - mediawiki-installation DSH group on mw2423 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:22:57] RECOVERY - mediawiki-installation DSH group on mw2424 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:22:59] RECOVERY - mediawiki-installation DSH group on mw2434 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:22:59] RECOVERY - mediawiki-installation DSH group on mw2435 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:22:59] RECOVERY - mediawiki-installation DSH group on mw2436 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:22:59] RECOVERY - mediawiki-installation DSH group on mw2437 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:22:59] RECOVERY - mediawiki-installation DSH group on mw2440 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:22:59] RECOVERY - mediawiki-installation DSH group on mw2442 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:23:00] RECOVERY - mediawiki-installation DSH group on mw2443 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:23:00] RECOVERY - mediawiki-installation DSH group on mw2450 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:23:01] RECOVERY - mediawiki-installation DSH group on mw2451 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:23:14] There's gonna be a bit more flooding and then I'll be done :D [15:23:34] !log cgoubert@cumin1001 conftool action : set/pooled=yes; selector: name=mw24[2345].*.codfw.wmnet,cluster=jobrunner [15:23:45] !log cgoubert@cumin1001 conftool action : set/pooled=yes; selector: name=mw24[2345].*.codfw.wmnet,cluster=videoscaler [15:25:45] (03PS2) 10Jaime Nuche: docker::gc: update configuration to use latest version of images [puppet] - 10https://gerrit.wikimedia.org/r/900312 [15:25:47] (03PS3) 10Jaime Nuche: deployment_server: clean up older images using systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/900313 (https://phabricator.wikimedia.org/T329678) [15:25:49] (03PS1) 10Jaime Nuche: deployment_server: ensure Docker is installed [puppet] - 10https://gerrit.wikimedia.org/r/900353 [15:25:52] (03CR) 10Nicolas Fraison: spark-operator: enable spark operator mutation webhook (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/897895 (https://phabricator.wikimedia.org/T331858) (owner: 10Nicolas Fraison) [15:26:11] 10SRE, 10sre-unowned: Migrate mw_rc_irc servers to Bullseye - https://phabricator.wikimedia.org/T331702 (10MoritzMuehlenhoff) [15:26:17] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: session-c624.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:26:23] 10SRE, 10sre-unowned: Migrate Kafka test cluster to Bullseye - https://phabricator.wikimedia.org/T331710 (10MoritzMuehlenhoff) [15:27:12] (03CR) 10Jaime Nuche: docker::gc: update configuration to use latest version of images (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/900312 (owner: 10Jaime Nuche) [15:27:53] (03PS1) 10Ayounsi: Test, add validator directory [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/900354 [15:28:05] (03PS3) 10Andrew Bogott: rbd2backy2: log 'expire' string before trying to parse it [puppet] - 10https://gerrit.wikimedia.org/r/900323 [15:28:07] (03CR) 10Andrew Bogott: rbd2backy2: log 'expire' string before trying to parse it (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/900323 (owner: 10Andrew Bogott) [15:28:12] !log enable puppet on R:class = dnsrecursor to merge CR: 898957 [done] [15:28:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:18] (03CR) 10Ayounsi: [C: 03+2] Test, add validator directory [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/900354 (owner: 10Ayounsi) [15:30:20] (03CR) 10Jaime Nuche: "PCC: https://puppet-compiler.wmflabs.org/output/900313/40167/" [puppet] - 10https://gerrit.wikimedia.org/r/900313 (https://phabricator.wikimedia.org/T329678) (owner: 10Jaime Nuche) [15:30:23] (03CR) 10Nicolas Fraison: spark-operator: enable spark operator mutation webhook (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/897895 (https://phabricator.wikimedia.org/T331858) (owner: 10Nicolas Fraison) [15:32:03] (03CR) 10Jbond: Netbox: activate validators (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/889959 (https://phabricator.wikimedia.org/T310590) (owner: 10Ayounsi) [15:32:03] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: session-c624.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:32:17] RECOVERY - mediawiki-installation DSH group on mw2426 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:32:17] RECOVERY - mediawiki-installation DSH group on mw2427 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:32:17] RECOVERY - mediawiki-installation DSH group on mw2428 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:32:17] RECOVERY - mediawiki-installation DSH group on mw2429 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:32:17] RECOVERY - mediawiki-installation DSH group on mw2430 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:32:18] RECOVERY - mediawiki-installation DSH group on mw2444 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:32:18] RECOVERY - mediawiki-installation DSH group on mw2445 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:32:19] RECOVERY - mediawiki-installation DSH group on mw2446 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:33:18] sukhe: There, no more flooding :D [15:34:14] claime: thanks! [15:36:56] 10SRE, 10Traffic: Clean up and refactor the dnsrecursor module - https://phabricator.wikimedia.org/T332083 (10ssingh) 05Open→03Resolved a:03ssingh This has been resolved with the https://gerrit.wikimedia.org/r/898957 and all `R:Class = dnsrecursor` hosts running bullseye: ` sukhe@cumin2002:~$ sudo cumin... [15:37:02] (03CR) 10Jaime Nuche: "PCC: https://puppet-compiler.wmflabs.org/output/900312/40173/" [puppet] - 10https://gerrit.wikimedia.org/r/900312 (owner: 10Jaime Nuche) [15:37:49] (03CR) 10Ayounsi: sre.netbox.deploy-extras: create a cookbook to deploy netbox-extras (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/900242 (https://phabricator.wikimedia.org/T310590) (owner: 10Jbond) [15:39:50] !log Pooled new mw hosts mw24[20-51].codfw.wmnet - T326363 [15:39:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:55] T326363: mw2420-mw2451 service implementation tracking - https://phabricator.wikimedia.org/T326363 [15:40:04] (03CR) 10JMeybohm: spark-operator: enable spark operator mutation webhook (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/897895 (https://phabricator.wikimedia.org/T331858) (owner: 10Nicolas Fraison) [15:41:06] (03PS1) 10Cathal Mooney: Modify netops Icinga checks to gracefully deal with SNMP timeout [puppet] - 10https://gerrit.wikimedia.org/r/900360 (https://phabricator.wikimedia.org/T332080) [15:41:45] (03CR) 10CI reject: [V: 04-1] Modify netops Icinga checks to gracefully deal with SNMP timeout [puppet] - 10https://gerrit.wikimedia.org/r/900360 (https://phabricator.wikimedia.org/T332080) (owner: 10Cathal Mooney) [15:43:22] (03CR) 10Ayounsi: [C: 03+1] Netbox: introduce support for validators [puppet] - 10https://gerrit.wikimedia.org/r/900316 (https://phabricator.wikimedia.org/T310590) (owner: 10Jbond) [15:43:57] (03CR) 10Ayounsi: Netbox: activate validators (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/889959 (https://phabricator.wikimedia.org/T310590) (owner: 10Ayounsi) [15:44:21] (03Abandoned) 10Ayounsi: Netbox: activate validators [puppet] - 10https://gerrit.wikimedia.org/r/889959 (https://phabricator.wikimedia.org/T310590) (owner: 10Ayounsi) [15:44:31] (03CR) 10David Caro: [C: 03+2] wmcs: update ceph alerts dashboard [alerts] - 10https://gerrit.wikimedia.org/r/895220 (owner: 10David Caro) [15:44:34] (03PS3) 10David Caro: wmcs: update ceph alerts dashboard [alerts] - 10https://gerrit.wikimedia.org/r/895220 [15:46:47] (03PS2) 10Cathal Mooney: Modify netops Icinga checks to gracefully deal with SNMP timeout [puppet] - 10https://gerrit.wikimedia.org/r/900360 (https://phabricator.wikimedia.org/T332080) [15:46:59] 10SRE, 10serviceops: mw2420-mw2451 service implementation tracking - https://phabricator.wikimedia.org/T326363 (10Clement_Goubert) All done ` {"mw2422.codfw.wmnet": {"weight": 30, "pooled": "yes"}, "tags": "dc=codfw,cluster=api_appserver,service=nginx"} {"mw2423.codfw.wmnet": {"weight": 30, "pooled": "yes"}, "... [15:47:06] (03PS10) 10Jbond: sre.netbox.deploy-extras: create a cookbook to deploy netbox-extras [cookbooks] - 10https://gerrit.wikimedia.org/r/900242 (https://phabricator.wikimedia.org/T310590) [15:47:25] (03CR) 10CI reject: [V: 04-1] Modify netops Icinga checks to gracefully deal with SNMP timeout [puppet] - 10https://gerrit.wikimedia.org/r/900360 (https://phabricator.wikimedia.org/T332080) (owner: 10Cathal Mooney) [15:47:27] (03CR) 10Nicolas Fraison: spark-operator: enable spark operator mutation webhook (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/897895 (https://phabricator.wikimedia.org/T331858) (owner: 10Nicolas Fraison) [15:47:38] (03Abandoned) 10Jbond: SREBatchBase: allow cookbooks to opt out of ensurable shell [cookbooks] - 10https://gerrit.wikimedia.org/r/900340 (owner: 10Jbond) [15:48:07] (03CR) 10Jbond: SREBatchBase: allow cookbooks to opt out of ensurable shell (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/900340 (owner: 10Jbond) [15:48:10] (03CR) 10Jaime Nuche: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/900353 (owner: 10Jaime Nuche) [15:48:14] (03PS11) 10Jbond: sre.netbox.deploy-extras: create a cookbook to deploy netbox-extras [cookbooks] - 10https://gerrit.wikimedia.org/r/900242 (https://phabricator.wikimedia.org/T310590) [15:49:25] (03CR) 10Ayounsi: sre.netbox.deploy-extras: create a cookbook to deploy netbox-extras (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/900242 (https://phabricator.wikimedia.org/T310590) (owner: 10Jbond) [15:49:49] (03PS3) 10Cathal Mooney: Modify netops Icinga checks to gracefully deal with SNMP timeout [puppet] - 10https://gerrit.wikimedia.org/r/900360 (https://phabricator.wikimedia.org/T332080) [15:50:26] (03CR) 10CI reject: [V: 04-1] Modify netops Icinga checks to gracefully deal with SNMP timeout [puppet] - 10https://gerrit.wikimedia.org/r/900360 (https://phabricator.wikimedia.org/T332080) (owner: 10Cathal Mooney) [15:50:32] (03PS12) 10Nicolas Fraison: spark-operator: enable spark operator mutation webhook [deployment-charts] - 10https://gerrit.wikimedia.org/r/897895 (https://phabricator.wikimedia.org/T331858) [15:50:34] (03PS5) 10Nicolas Fraison: spark: Allow communication from spark pods to HDFS/Hive [deployment-charts] - 10https://gerrit.wikimedia.org/r/899630 (https://phabricator.wikimedia.org/T331859) [15:51:35] (03CR) 10Ahmon Dancy: deployment_server: ensure Docker is installed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/900353 (owner: 10Jaime Nuche) [15:51:41] (03CR) 10JMeybohm: "The update to 1.23 accidentally enabled the IPv6DualStack feature gate in kubelet even for clusters with "profile::kubernetes::ipv6dualsta" [puppet] - 10https://gerrit.wikimedia.org/r/899652 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm) [15:53:11] (03CR) 10Ahmon Dancy: [C: 03+1] deployment_server: clean up older images using systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/900313 (https://phabricator.wikimedia.org/T329678) (owner: 10Jaime Nuche) [15:53:25] (03PS4) 10Cathal Mooney: Modify netops Icinga checks to gracefully deal with SNMP timeout [puppet] - 10https://gerrit.wikimedia.org/r/900360 (https://phabricator.wikimedia.org/T332080) [15:54:21] (03PS12) 10Jbond: sre.netbox.deploy-extras: create a cookbook to deploy netbox-extras [cookbooks] - 10https://gerrit.wikimedia.org/r/900242 (https://phabricator.wikimedia.org/T310590) [15:54:42] (03CR) 10Jaime Nuche: deployment_server: ensure Docker is installed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/900353 (owner: 10Jaime Nuche) [15:55:10] (03CR) 10Ahmon Dancy: [C: 03+1] docker::gc: update configuration to use latest version of images [puppet] - 10https://gerrit.wikimedia.org/r/900312 (owner: 10Jaime Nuche) [15:55:45] (03CR) 10Ahmon Dancy: [C: 03+1] deployment_server: ensure Docker is installed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/900353 (owner: 10Jaime Nuche) [15:56:38] (03CR) 10Jbond: sre.netbox.deploy-extras: create a cookbook to deploy netbox-extras (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/900242 (https://phabricator.wikimedia.org/T310590) (owner: 10Jbond) [15:57:00] (03PS13) 10Jbond: sre.netbox.deploy-extras: create a cookbook to deploy netbox-extras [cookbooks] - 10https://gerrit.wikimedia.org/r/900242 (https://phabricator.wikimedia.org/T310590) [15:57:33] (03PS5) 10Jameel Kaisar: Create and deploy per-CDN-site DNS domains [dns] - 10https://gerrit.wikimedia.org/r/899214 (https://phabricator.wikimedia.org/T332025) [15:59:38] (03PS2) 10Jaime Nuche: deployment_server: ensure Docker is installed [puppet] - 10https://gerrit.wikimedia.org/r/900353 (https://phabricator.wikimedia.org/T329622) [16:00:04] jbond and rzl: That opportune time is upon us again. Time for a Puppet request window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230316T1600). [16:00:04] No Gerrit patches in the queue for this window AFAICS. [16:02:53] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: session-c624.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:03:15] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10cmooney) >>! In T327919#8702610, @Papaul wrote: > I know the first 4 ports on cloudsw1-b1 are set up as... [16:04:53] RECOVERY - cinder-volume process on cloudcontrol1007 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python.* /usr/bin/cinder-volume https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [16:06:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PUT deployments) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:07:36] (03CR) 10Volans: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/899214 (https://phabricator.wikimedia.org/T332025) (owner: 10Jameel Kaisar) [16:09:54] (03PS14) 10Jbond: sre.netbox.deploy-extras: create a cookbook to deploy netbox-extras [cookbooks] - 10https://gerrit.wikimedia.org/r/900242 (https://phabricator.wikimedia.org/T310590) [16:11:38] (03PS1) 10Btullis: Upgrade Airflow on the platform_eng instance and switch to PostgreSQL [puppet] - 10https://gerrit.wikimedia.org/r/900366 (https://phabricator.wikimedia.org/T326193) [16:11:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT deployments) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:13:01] (03PS15) 10Jbond: sre.netbox.deploy-extras: create a cookbook to deploy netbox-extras [cookbooks] - 10https://gerrit.wikimedia.org/r/900242 (https://phabricator.wikimedia.org/T310590) [16:13:52] (03PS1) 10EoghanGaffney: Allow E_DEPRECATED logs to be shown on php-fpm in doc machines [puppet] - 10https://gerrit.wikimedia.org/r/900369 (https://phabricator.wikimedia.org/T325245) [16:14:16] (03PS1) 10Jbond: Revert "Test, add validator directory" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/900138 [16:14:19] (03CR) 10CI reject: [V: 04-1] Allow E_DEPRECATED logs to be shown on php-fpm in doc machines [puppet] - 10https://gerrit.wikimedia.org/r/900369 (https://phabricator.wikimedia.org/T325245) (owner: 10EoghanGaffney) [16:14:55] (03CR) 10Jbond: [V: 03+2 C: 03+2] Revert "Test, add validator directory" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/900138 (owner: 10Jbond) [16:15:22] (03PS2) 10EoghanGaffney: Allow E_DEPRECATED logs to be shown on php-fpm in doc machines [puppet] - 10https://gerrit.wikimedia.org/r/900369 (https://phabricator.wikimedia.org/T325245) [16:16:20] (03PS1) 10Jbond: test validators with multiple files [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/900371 [16:16:37] (03CR) 10Jbond: [V: 03+2 C: 03+2] test validators with multiple files [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/900371 (owner: 10Jbond) [16:21:04] (03PS16) 10Jbond: sre.netbox.deploy-extras: create a cookbook to deploy netbox-extras [cookbooks] - 10https://gerrit.wikimedia.org/r/900242 (https://phabricator.wikimedia.org/T310590) [16:21:33] (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40175/console" [puppet] - 10https://gerrit.wikimedia.org/r/900369 (https://phabricator.wikimedia.org/T325245) (owner: 10EoghanGaffney) [16:21:42] 10SRE, 10LDAP-Access-Requests: Grant Access to analytics_privatedata_users for FNavas-foundation - https://phabricator.wikimedia.org/T331482 (10FNavas-foundation) Hi @MatthewVernon, thanks for picking this up - I do need Turnilo access. I also need access to a special dashboard created by @Pablo. Sorry about... [16:21:44] (03PS1) 10Jbond: Revert "test validators with multiple files" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/900139 [16:21:49] (03CR) 10Jbond: [V: 03+2 C: 03+2] Revert "test validators with multiple files" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/900139 (owner: 10Jbond) [16:22:40] (03PS1) 10EoghanGaffney: Adds php and apache logs for doc machines [puppet] - 10https://gerrit.wikimedia.org/r/900375 (https://phabricator.wikimedia.org/T325245) [16:22:55] (03CR) 10Jbond: "ready for review: tested and lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/900242 (https://phabricator.wikimedia.org/T310590) (owner: 10Jbond) [16:23:30] (03CR) 10Jbond: [C: 03+2] Netbox: introduce support for validators [puppet] - 10https://gerrit.wikimedia.org/r/900316 (https://phabricator.wikimedia.org/T310590) (owner: 10Jbond) [16:24:53] (03CR) 10Jbond: [V: 03+1 C: 03+2] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40176/console" [puppet] - 10https://gerrit.wikimedia.org/r/900316 (https://phabricator.wikimedia.org/T310590) (owner: 10Jbond) [16:25:58] (03CR) 10JHathaway: [C: 03+2] exim: remove wikimedia.com from wikimedia_domains [puppet] - 10https://gerrit.wikimedia.org/r/898896 (https://phabricator.wikimedia.org/T331676) (owner: 10Dzahn) [16:29:17] (03PS4) 10Andrew Bogott: rbd2backy2: log 'expire' string before trying to parse it [puppet] - 10https://gerrit.wikimedia.org/r/900323 [16:29:19] (03PS1) 10Andrew Bogott: Trove: increase volume formats a whole lot [puppet] - 10https://gerrit.wikimedia.org/r/900379 [16:30:54] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be2067.codfw.wmnet [16:30:59] (03CR) 10Jameel Kaisar: "Done" [dns] - 10https://gerrit.wikimedia.org/r/899214 (https://phabricator.wikimedia.org/T332025) (owner: 10Jameel Kaisar) [16:31:11] !log reboot ms-be2067 again to see if the missing drive comes back [16:31:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:49] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: session-c624.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:31:57] (03CR) 10Andrew Bogott: [C: 03+2] rbd2backy2: log 'expire' string before trying to parse it [puppet] - 10https://gerrit.wikimedia.org/r/900323 (owner: 10Andrew Bogott) [16:32:06] (03CR) 10Andrew Bogott: [C: 03+2] Trove: increase volume formats a whole lot [puppet] - 10https://gerrit.wikimedia.org/r/900379 (owner: 10Andrew Bogott) [16:32:10] (03PS1) 10Elukey: ml-services: limit deployments of experimental to ml-staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/900381 [16:35:20] (03CR) 10CI reject: [V: 04-1] ml-services: limit deployments of experimental to ml-staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/900381 (owner: 10Elukey) [16:36:17] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:37:29] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: session-c624.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:45:49] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:45:57] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Two failed disks in ms-be2067 - https://phabricator.wikimedia.org/T331030 (10MatthewVernon) @Papaul that's the one - can you clear the Foreign state from that disk, please? I can't figure out how to do it, and I think without that config being cleared it's... [16:46:03] PROBLEM - pybal on lvs4010 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [16:46:15] ^ expected, rebooting [16:46:23] PROBLEM - PyBal backends health check on lvs4010 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [16:46:33] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:46:55] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on lvs4010.ulsfo.wmnet with reason: rebooting for kernel updates [16:47:11] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on lvs4010.ulsfo.wmnet with reason: rebooting for kernel updates [16:47:13] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:50:02] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10Papaul) cloudservices2004-dev = U37 cloudservices2005-dev = U38 cloudweb2002-dev = 39 yes we already o... [16:53:03] RECOVERY - BGP status on cr3-ulsfo is OK: BGP OK - up: 91, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:53:47] RECOVERY - pybal on lvs4010 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [16:54:07] RECOVERY - PyBal backends health check on lvs4010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:54:17] RECOVERY - BGP status on cr4-ulsfo is OK: BGP OK - up: 107, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:56:05] !log sukhe@cumin2002 START - Cookbook sre.hosts.remove-downtime for lvs4010.ulsfo.wmnet [16:56:05] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs4010.ulsfo.wmnet [16:56:32] 10SRE, 10LDAP-Access-Requests: Grant Access to analytics_privatedata_users for FNavas-foundation - https://phabricator.wikimedia.org/T331482 (10Aklapper) > I made another user with my WMF mediawiki account (FNavas-foundation), which is linked to my foundation email. On a related note, in staff capacity I'd rec... [16:56:52] (03CR) 10Bking: [V: 03+1 C: 03+2] wdqs: export more jmx metrics to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/898687 (https://phabricator.wikimedia.org/T331405) (owner: 10DCausse) [16:56:53] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: session-c624.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:57:06] (CirrusSearchJobQueueBacklogTooBig) firing: CirrusSearch job topic codfw.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 218k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig [16:58:46] !log xcollazo@deploy2002 Started deploy [airflow-dags/platform_eng@e17ee96]: First deploy after Airflow 2.5.1 upgrade. [16:59:11] !log xcollazo@deploy2002 Finished deploy [airflow-dags/platform_eng@e17ee96]: First deploy after Airflow 2.5.1 upgrade. (duration: 00m 24s) [17:01:19] (03CR) 10Btullis: [C: 03+2] Upgrade Airflow on the platform_eng instance and switch to PostgreSQL [puppet] - 10https://gerrit.wikimedia.org/r/900366 (https://phabricator.wikimedia.org/T326193) (owner: 10Btullis) [17:02:29] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: session-c624.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:05:41] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 0:15:00 on lvs4008.ulsfo.wmnet with reason: rebooting for kernel updates [17:05:56] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on lvs4008.ulsfo.wmnet with reason: rebooting for kernel updates [17:06:23] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: session-c624.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:09:45] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:10:11] expected [17:10:25] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:11:34] (03PS1) 10Hnowlan: thumbor: bump workers, reduce CPU, increase queue [deployment-charts] - 10https://gerrit.wikimedia.org/r/900388 (https://phabricator.wikimedia.org/T328033) [17:12:03] (03PS1) 10Cmjohnson: Adding ms-fe1013-4 and thanos-fe1004 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/898922 (https://phabricator.wikimedia.org/T326846) [17:12:06] (CirrusSearchJobQueueBacklogTooBig) resolved: CirrusSearch job topic codfw.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 202k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig [17:12:09] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: session-c624.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:13:14] (03CR) 10Cmjohnson: [C: 03+2] Adding ms-fe1013-4 and thanos-fe1004 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/898922 (https://phabricator.wikimedia.org/T326846) (owner: 10Cmjohnson) [17:17:23] (03CR) 10FNegri: [C: 03+2] [tbs.harbor] Clean up admin pwd management [puppet] - 10https://gerrit.wikimedia.org/r/899724 (https://phabricator.wikimedia.org/T316323) (owner: 10FNegri) [17:19:40] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10cmooney) Ok cool. So I'd propose we take it like this: **1. Move sretest2001 from port xe-0/0/1 to xe... [17:19:51] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: session-c624.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:21:37] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host ms-fe1013.eqiad.wmnet with OS bullseye [17:21:50] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004 - https://phabricator.wikimedia.org/T326846 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host ms-fe1013.eqiad.wmnet with... [17:21:51] PROBLEM - pybal on lvs4008 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [17:22:06] ^ expected [17:22:17] PROBLEM - PyBal backends health check on lvs4008 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [17:22:33] PROBLEM - PyBal connections to etcd on lvs4008 is CRITICAL: CRITICAL: 0 connections established with conf2006.codfw.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [17:22:42] (03PS1) 10Sharvaniharan: Remove sampling from breadCrumbs schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900396 [17:22:49] (03CR) 10CI reject: [V: 04-1] Remove sampling from breadCrumbs schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900396 (owner: 10Sharvaniharan) [17:26:09] PROBLEM - PyBal backends health check on lvs4008 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [17:27:39] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: session-c624.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:27:41] PROBLEM - pybal on lvs4008 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [17:29:03] RECOVERY - BGP status on cr4-ulsfo is OK: BGP OK - up: 107, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:29:37] RECOVERY - pybal on lvs4008 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [17:29:45] RECOVERY - BGP status on cr3-ulsfo is OK: BGP OK - up: 91, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:30:03] RECOVERY - PyBal backends health check on lvs4008 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:30:49] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host thanos-fe1004.eqiad.wmnet with OS bullseye [17:30:57] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004 - https://phabricator.wikimedia.org/T326846 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host thanos-fe1004.eqiad.wmnet with OS bullseye [17:31:42] (03PS1) 10Sharvaniharan: Remove sampling from breadCrumbs schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900399 [17:31:52] (03Abandoned) 10Sharvaniharan: Remove sampling from breadCrumbs schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900396 (owner: 10Sharvaniharan) [17:32:51] (03CR) 10Ayounsi: [C: 03+2] "Awesome!" [cookbooks] - 10https://gerrit.wikimedia.org/r/900242 (https://phabricator.wikimedia.org/T310590) (owner: 10Jbond) [17:32:53] (03PS1) 10FNegri: [tbs.harbor] Remove duplicate pwd [puppet] - 10https://gerrit.wikimedia.org/r/900400 (https://phabricator.wikimedia.org/T316323) [17:33:27] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: session-c624.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:34:23] RECOVERY - PyBal connections to etcd on lvs4008 is OK: OK: 12 connections established with conf2006.codfw.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [17:35:01] (03Merged) 10jenkins-bot: sre.netbox.deploy-extras: create a cookbook to deploy netbox-extras [cookbooks] - 10https://gerrit.wikimedia.org/r/900242 (https://phabricator.wikimedia.org/T310590) (owner: 10Jbond) [17:35:04] (03PS1) 10David Caro: harbor: don't use https, use web proxies instead [puppet] - 10https://gerrit.wikimedia.org/r/900401 [17:36:22] !log cmjohnson@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host thanos-fe1004.eqiad.wmnet with OS bullseye [17:36:28] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004 - https://phabricator.wikimedia.org/T326846 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host thanos-fe1004.eqiad.wmnet with OS bullseye executed w... [17:37:08] (03CR) 10CI reject: [V: 04-1] harbor: don't use https, use web proxies instead [puppet] - 10https://gerrit.wikimedia.org/r/900401 (owner: 10David Caro) [17:37:10] (03PS1) 10Btullis: Remove the overridden configuration for airflow-platform_eng [puppet] - 10https://gerrit.wikimedia.org/r/900403 (https://phabricator.wikimedia.org/T326193) [17:38:41] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40177/console" [puppet] - 10https://gerrit.wikimedia.org/r/900403 (https://phabricator.wikimedia.org/T326193) (owner: 10Btullis) [17:39:18] (03CR) 10Btullis: [V: 03+1 C: 03+2] Remove the overridden configuration for airflow-platform_eng [puppet] - 10https://gerrit.wikimedia.org/r/900403 (https://phabricator.wikimedia.org/T326193) (owner: 10Btullis) [17:40:09] (03PS2) 10David Caro: harbor: don't use https, use web proxies instead [puppet] - 10https://gerrit.wikimedia.org/r/900401 [17:40:11] !log ayounsi@cumin2002 START - Cookbook sre.netbox.update-extras rolling update on A:netbox-canary [17:40:12] !log ayounsi@cumin2002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling update on A:netbox-canary [17:40:46] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host thanos-fe1004.eqiad.wmnet with OS bullseye [17:40:52] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004 - https://phabricator.wikimedia.org/T326846 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host thanos-fe1004.eqiad.wmnet with OS bullseye [17:41:15] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/900317 (https://phabricator.wikimedia.org/T310590) (owner: 10Jbond) [17:41:17] (03PS3) 10David Caro: harbor: don't use https, use web proxies instead [puppet] - 10https://gerrit.wikimedia.org/r/900401 [17:41:29] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 0:25:00 on lvs4009.ulsfo.wmnet with reason: rebooting for kernel updates [17:41:38] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:25:00 on lvs4009.ulsfo.wmnet with reason: rebooting for kernel updates [17:43:33] (03CR) 10David Caro: "This is currently deployed in toolsbeta successfully \o/" [puppet] - 10https://gerrit.wikimedia.org/r/900401 (owner: 10David Caro) [17:43:38] (03CR) 10David Caro: [V: 03+1] harbor: don't use https, use web proxies instead [puppet] - 10https://gerrit.wikimedia.org/r/900401 (owner: 10David Caro) [17:44:32] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004 - https://phabricator.wikimedia.org/T326846 (10Cmjohnson) [17:45:03] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service,session-c624.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:45:43] (03PS1) 10Jbond: sre.hosts.reboot-single: args.depool not args.pool [cookbooks] - 10https://gerrit.wikimedia.org/r/900405 [17:46:16] (03PS3) 10Jaime Nuche: deployment_server: ensure Docker is installed [puppet] - 10https://gerrit.wikimedia.org/r/900353 (https://phabricator.wikimedia.org/T329622) [17:46:25] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:47:03] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:47:24] expected ^ [17:48:51] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service,session-c624.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:50:31] (03CR) 10Jaime Nuche: "PCC: https://puppet-compiler.wmflabs.org/output/900353/40178/" [puppet] - 10https://gerrit.wikimedia.org/r/900353 (https://phabricator.wikimedia.org/T329622) (owner: 10Jaime Nuche) [17:52:09] (03CR) 10Ayounsi: [C: 03+2] "I manually cherry-picked the netbox-extra patch on netbox-dev as well as checked that the symlink is still there." [puppet] - 10https://gerrit.wikimedia.org/r/900317 (https://phabricator.wikimedia.org/T310590) (owner: 10Jbond) [17:52:34] (03CR) 10Sharvaniharan: "Hi! Please review when you get a chance :-)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900399 (owner: 10Sharvaniharan) [17:54:06] (03CR) 10Ahmon Dancy: [C: 03+1] deployment_server: ensure Docker is installed [puppet] - 10https://gerrit.wikimedia.org/r/900353 (https://phabricator.wikimedia.org/T329622) (owner: 10Jaime Nuche) [17:54:10] (03PS1) 10David Caro: k8s: update harbor fqdn to the proxy url [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/900408 [17:54:26] (03PS2) 10David Caro: k8s: update harbor fqdn to the proxy url [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/900408 [17:56:27] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service,session-c624.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:01:18] (03CR) 10Ayounsi: [C: 03+1] Modify netops Icinga checks to gracefully deal with SNMP timeout (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/900360 (https://phabricator.wikimedia.org/T332080) (owner: 10Cathal Mooney) [18:01:49] (03PS1) 10EoghanGaffney: Add doc host apache/php-fpm logs to kafka [puppet] - 10https://gerrit.wikimedia.org/r/900410 (https://phabricator.wikimedia.org/T325245) [18:03:34] !log sukhe@cumin2002 START - Cookbook sre.hosts.remove-downtime for lvs4009.ulsfo.wmnet [18:03:35] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs4009.ulsfo.wmnet [18:05:30] jouncebot: nowandnext [18:05:30] For the next 1 hour(s) and 54 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230316T1800) [18:05:30] In 1 hour(s) and 54 minute(s): UTC late backport and config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230316T2000) [18:05:43] (03PS3) 10David Caro: k8s: update harbor fqdn to the proxy url [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/900408 [18:13:52] (03CR) 10FNegri: harbor: don't use https, use web proxies instead (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/900401 (owner: 10David Caro) [18:15:56] (03PS4) 10David Caro: harbor: don't use https, use web proxies instead [puppet] - 10https://gerrit.wikimedia.org/r/900401 [18:17:10] (03PS5) 10David Caro: harbor: don't use https, use web proxies instead [puppet] - 10https://gerrit.wikimedia.org/r/900401 [18:17:25] (03CR) 10David Caro: harbor: don't use https, use web proxies instead (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/900401 (owner: 10David Caro) [18:17:50] (03CR) 10FNegri: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/900401 (owner: 10David Caro) [18:18:06] RECOVERY - BGP status on cr3-ulsfo is OK: BGP OK - up: 91, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:19:54] (03CR) 10David Caro: [C: 03+2] harbor: don't use https, use web proxies instead [puppet] - 10https://gerrit.wikimedia.org/r/900401 (owner: 10David Caro) [18:19:56] PROBLEM - Host ms-be2067 is DOWN: PING CRITICAL - Packet loss = 100% [18:22:43] ^ known? [18:22:46] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service,session-c624.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:24:18] RECOVERY - BGP status on cr4-ulsfo is OK: BGP OK - up: 107, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:26:04] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service,session-c624.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:26:32] RECOVERY - Host ms-be2067 is UP: PING OK - Packet loss = 0%, RTA = 31.58 ms [18:27:13] (03CR) 10Dbrant: [C: 03+1] Remove sampling from breadCrumbs schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900399 (owner: 10Sharvaniharan) [18:28:20] (03CR) 10Cathal Mooney: [C: 03+2] Add automation for EVPN BGP peerings [homer/public] - 10https://gerrit.wikimedia.org/r/894741 (https://phabricator.wikimedia.org/T327934) (owner: 10Cathal Mooney) [18:28:52] (03Merged) 10jenkins-bot: Add automation for EVPN BGP peerings [homer/public] - 10https://gerrit.wikimedia.org/r/894741 (https://phabricator.wikimedia.org/T327934) (owner: 10Cathal Mooney) [18:29:02] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Two failed disks in ms-be2067 - https://phabricator.wikimedia.org/T331030 (10Papaul) 05Open→03Resolved Physical Disk 0:2:19 Online 19 7451.5 GB SATA HDD No [18:29:23] 10SRE, 10Domains, 10Traffic-Icebox: Register wiki(m|p)edia.ro - https://phabricator.wikimedia.org/T222080 (10BCornwall) Hi, @CRoslof! Have you been able to look into these registrations? Thanks! [18:29:38] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service,session-c624.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:31:24] PROBLEM - Check systemd state on ms-be2067 is CRITICAL: CRITICAL - degraded: The following units failed: srv-swift\x2dstorage-sdv1.mount https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:32:18] (03CR) 10Cathal Mooney: [C: 03+2] Modify netops Icinga checks to gracefully deal with SNMP timeout (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/900360 (https://phabricator.wikimedia.org/T332080) (owner: 10Cathal Mooney) [18:37:04] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-fe1004.eqiad.wmnet with OS bullseye [18:37:10] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004 - https://phabricator.wikimedia.org/T326846 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host thanos-fe1004.eqiad.wmnet with OS bullseye executed w... [18:38:25] (03CR) 10FNegri: [C: 03+1] k8s: update harbor fqdn to the proxy url [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/900408 (owner: 10David Caro) [18:38:27] !log mvernon@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host ms-be2067.codfw.wmnet [18:40:45] !log xcollazo@deploy2002 Started deploy [airflow-dags/platform_eng@5c2c701]: (no justification provided) [18:40:58] !log xcollazo@deploy2002 Finished deploy [airflow-dags/platform_eng@5c2c701]: (no justification provided) (duration: 00m 13s) [18:41:28] 10SRE, 10Traffic, 10VPS-project-Codesearch, 10serviceops, 10Patch-For-Review: Consider using BindsTo instead of Requires to declare dependencies between systemd unit - https://phabricator.wikimedia.org/T284555 (10BCornwall) [18:44:00] 10SRE, 10Infrastructure-Foundations, 10netops: Automate EVPN switch underlay BGP neighbor peerings - https://phabricator.wikimedia.org/T327934 (10cmooney) 05Open→03Resolved Changes merged and feature now fully controlled from automation / homer. [18:49:22] 10SRE, 10ops-eqiad, 10SRE Observability (FY2022/2023-Q3): Decommission centrallog1001 - https://phabricator.wikimedia.org/T328803 (10cmooney) @andrea.denisse just a heads up we got an alarm on our core routers in Eqiad for a BFD/BGP session down. Seems this server was configured to BGP peer with the CRs? `... [18:51:39] (03PS1) 10Cathal Mooney: Remove BGP peering to centrallog1001 in eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/900422 (https://phabricator.wikimedia.org/T328803) [18:52:27] (03CR) 10Cathal Mooney: [C: 03+2] Remove BGP peering to centrallog1001 in eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/900422 (https://phabricator.wikimedia.org/T328803) (owner: 10Cathal Mooney) [18:53:04] (03Merged) 10jenkins-bot: Remove BGP peering to centrallog1001 in eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/900422 (https://phabricator.wikimedia.org/T328803) (owner: 10Cathal Mooney) [18:58:35] 10SRE, 10Infrastructure-Foundations, 10netops, 10observability, 10Patch-For-Review: SNMP Network Checks throw exception when device is unreachable - https://phabricator.wikimedia.org/T332080 (10cmooney) 05Open→03Resolved Closing this one, all our checks now deal with the scenario gracefully. The exa... [18:59:20] RECOVERY - BFD status on cr1-eqiad is OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:00:23] Hi :wave [19:00:44] Can we help sharvani_ [19:00:46] here for the "UTC morning backport and config training" window [19:00:59] jouncebot: now [19:00:59] For the next 0 hour(s) and 59 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230316T1800) [19:01:05] jouncebot: next [19:01:05] In 0 hour(s) and 58 minute(s): UTC late backport and config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230316T2000) [19:01:14] sharvani_: you are an hour early [19:02:01] sharvani_: what’s your phab username? [19:02:19] Sharvaniharan [19:03:13] sharvani_: one moment [19:04:02] I don’t see you as signed up [19:04:32] (03CR) 10Cwhite: [C: 03+2] varnish: ignore levelname field [puppet] - 10https://gerrit.wikimedia.org/r/891391 (https://phabricator.wikimedia.org/T330267) (owner: 10Cwhite) [19:05:09] brennen, thcipriani: are you happy with sharvani_ doing the training? [19:05:46] I am not doing the training, just have a patch to be deployed. [19:05:47] sharvani_: hang around, hopefully the trainers can confirm they are planning something [19:06:03] ah, if not attending the training, then yeah - wait for backport window. [19:06:16] sharvani_: your patch isn’t scheduled [19:06:37] someone will likely be around to backport. and yes, please add patch to schedule. [19:08:43] https://wikitech.wikimedia.org/wiki/Backport_windows#How_to_submit_a_patch_for_backport [19:10:01] Sorry had missed adding my ircnick to the request. Did that! thank you. [19:10:47] 10SRE, 10Observability-Logging, 10Traffic: varnish-frontend-fetcherr sets incorrect level in logstash - https://phabricator.wikimedia.org/T330267 (10colewhite) 05Open→03Resolved a:03colewhite Fixes rolled out for SEVERITY_LABEL and levelName field. [19:15:08] RECOVERY - BFD status on cr2-eqiad is OK: UP: 18 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:18:56] 10SRE, 10LDAP-Access-Requests: Grant Access to analytics_privatedata_users for FNavas-foundation - https://phabricator.wikimedia.org/T331482 (10MoritzMuehlenhoff) @FNavas-foundation Who's your manager? They need to sign off on this task. @Ottomata This needs approval for analytics-privatedata-users [19:19:40] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:20:13] 10SRE, 10SRE-Access-Requests: Grant Access to analytics_privatedata_users for FNavas-foundation - https://phabricator.wikimedia.org/T331482 (10MoritzMuehlenhoff) [19:27:54] !log xcollazo@deploy2002 Started deploy [airflow-dags/platform_eng@a587106]: (no justification provided) [19:28:06] !log xcollazo@deploy2002 Finished deploy [airflow-dags/platform_eng@a587106]: (no justification provided) (duration: 00m 12s) [19:31:56] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:32:09] (03PS1) 10Jforrester: Revert "Upgrading lcobucci/jwt (4.1.5 => 4.3.0)" [vendor] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/900427 (https://phabricator.wikimedia.org/T321160) [19:32:31] (03PS1) 10Jforrester: Revert "build: Remove pinning of indirect lcobucci/jwt dependency" [extensions/OAuth] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/900144 (https://phabricator.wikimedia.org/T321160) [19:35:13] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Grant Hal deployment rights - https://phabricator.wikimedia.org/T331647 (10Htriedman) just bumping this! [19:35:26] (03PS2) 10Jforrester: Revert "build: Remove pinning of indirect lcobucci/jwt dependency" [extensions/OAuth] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/900144 (https://phabricator.wikimedia.org/T321160) [19:41:12] 10SRE, 10Traffic-Icebox: cache_upload varnish-fe exhausting transient memory - https://phabricator.wikimedia.org/T249809 (10BCornwall) 05Open→03Stalled @Vgutierrez, @BBlack: What's the status of this? It's three years old but I took a look at the transient memory and there are only a few instances where it... [19:44:38] 10SRE, 10Fundraising-Backlog, 10Traffic-Icebox, 10fr-donorservices, and 2 others: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561 (10BCornwall) 05Open→03Resolved [19:49:07] (03PS1) 10Ryan Kemper: [WIP] wdqs: test new metric option [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/900430 (https://phabricator.wikimedia.org/T328306) [19:49:56] (03PS2) 10Ryan Kemper: [WIP] wdqs: test new metric option [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/900430 (https://phabricator.wikimedia.org/T328306) [19:50:54] note for backport deployers: currently working on a train blocker, i may need you to hold off. [19:51:00] (03PS1) 10Cathal Mooney: Enable OSPF check by default for l3 switch mgmt interfaces [puppet] - 10https://gerrit.wikimedia.org/r/900431 (https://phabricator.wikimedia.org/T315053) [19:51:08] thank you for the update [19:52:44] (03PS3) 10Ryan Kemper: [WIP] wdqs: test new metric option [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/900430 (https://phabricator.wikimedia.org/T328306) [19:54:27] i'm just waiting at present on https://gerrit.wikimedia.org/r/c/mediawiki/vendor/+/900427/ [19:56:10] (03CR) 10Brennen Bearnes: [C: 03+2] Revert "Upgrading lcobucci/jwt (4.1.5 => 4.3.0)" [vendor] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/900427 (https://phabricator.wikimedia.org/T321160) (owner: 10Jforrester) [19:56:19] (heh, would help to +2) [19:59:01] :p [20:00:05] brennen and TheresNoTime: How many deployers does it take to do UTC late backport and config training deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230316T2000). [20:00:05] sharvani_: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:09] * TheresNoTime can deploy, holding off deploy per brennen [20:01:56] thanks TheresNoTime, sharvani_. i'll ping once this is safely out, you can do backports, and then we'll roll the train forward. [20:02:09] sure thing :) [20:02:56] Sure.. :) [20:10:07] (03PS1) 10Andrew Bogott: Trove: adjust volume formats a whole lot [puppet] - 10https://gerrit.wikimedia.org/r/900436 [20:10:45] (03Merged) 10jenkins-bot: Revert "Upgrading lcobucci/jwt (4.1.5 => 4.3.0)" [vendor] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/900427 (https://phabricator.wikimedia.org/T321160) (owner: 10Jforrester) [20:10:52] (03PS2) 10Andrew Bogott: Trove: adjust volume format timeouts again [puppet] - 10https://gerrit.wikimedia.org/r/900436 [20:12:43] !log brennen@deploy2002 Started scap: Backport for [[gerrit:900427|Revert "Upgrading lcobucci/jwt (4.1.5 => 4.3.0)" (T321160)]] [20:12:49] T321160: Lcobucci\JWT\Signer\InvalidKeyProvided: Key cannot be empty - https://phabricator.wikimedia.org/T321160 [20:13:15] (03CR) 10Andrew Bogott: [C: 03+2] Trove: adjust volume format timeouts again [puppet] - 10https://gerrit.wikimedia.org/r/900436 (owner: 10Andrew Bogott) [20:14:38] !log brennen@deploy2002 brennen and jforrester: Backport for [[gerrit:900427|Revert "Upgrading lcobucci/jwt (4.1.5 => 4.3.0)" (T321160)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet [20:18:59] 10SRE, 10SRE-Access-Requests: Grant Access to analytics_privatedata_users for FNavas-foundation - https://phabricator.wikimedia.org/T331482 (10FNavas-foundation) @MoritzMuehlenhoff my manager is @RBrounley_WMF (he is off on holiday at the moment). @Aklapper is there someone who can help me untangle this? per... [20:21:49] !log brennen@deploy2002 Finished scap: Backport for [[gerrit:900427|Revert "Upgrading lcobucci/jwt (4.1.5 => 4.3.0)" (T321160)]] (duration: 09m 06s) [20:21:55] T321160: Lcobucci\JWT\Signer\InvalidKeyProvided: Key cannot be empty - https://phabricator.wikimedia.org/T321160 [20:22:45] ok, vendor patch deployed; nothing seems to be exploding, at least. [20:23:06] TheresNoTime: give me 2 min to see if bug is fixed, then it's all yours. [20:23:11] ack :) [20:25:06] TheresNoTime: all yours [20:25:12] great [20:25:18] sharvani_: you ready? [20:25:21] (03PS1) 10JHathaway: jaeger: upgrade chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/900438 (https://phabricator.wikimedia.org/T320554) [20:25:26] ready! [20:25:30] thank you :) [20:25:33] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900399 (owner: 10Sharvaniharan) [20:26:18] (03Merged) 10jenkins-bot: Remove sampling from breadCrumbs schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900399 (owner: 10Sharvaniharan) [20:26:40] sharvani_: am I correct in saying this is something which won't need testing on the mwdebug servers? [20:26:42] !log samtar@deploy2002 Started scap: Backport for [[gerrit:900399|Remove sampling from breadCrumbs schema]] [20:26:59] I can test it... is it deployed on 2002? [20:27:16] not yet, I'll let you know :) [20:27:54] ok ready to test whenever it is done :) [20:28:30] !log samtar@deploy2002 samtar and sharvaniharan: Backport for [[gerrit:900399|Remove sampling from breadCrumbs schema]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet [20:28:42] sharvani_: that's live on the mwdebug servers for testing [20:28:53] tested and looks good! [20:28:59] great, syncing now [20:29:09] thank you for deploying for me! [20:34:58] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:35:01] !log samtar@deploy2002 Finished scap: Backport for [[gerrit:900399|Remove sampling from breadCrumbs schema]] (duration: 08m 18s) [20:35:11] and that's now live sharvani_ :) [20:35:28] Perfect! thank you! [20:35:58] !log close UTC late backport window [20:36:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:49] !log 1.40.0-wmf.27 train (T330205): blockers hopefully resolved, rolling to all wikis [20:36:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:54] T330205: 1.40.0-wmf.27 deployment blockers - https://phabricator.wikimedia.org/T330205 [20:37:52] (03PS1) 10JHathaway: jaeger: Add network policy support [deployment-charts] - 10https://gerrit.wikimedia.org/r/900443 (https://phabricator.wikimedia.org/T320554) [20:38:49] (03CR) 10JHathaway: [C: 03+2] jaeger: upgrade chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/900438 (https://phabricator.wikimedia.org/T320554) (owner: 10JHathaway) [20:39:40] (03PS1) 10TrainBranchBot: all wikis to 1.40.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900444 (https://phabricator.wikimedia.org/T330205) [20:39:42] (03CR) 10TrainBranchBot: [C: 03+2] all wikis to 1.40.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900444 (https://phabricator.wikimedia.org/T330205) (owner: 10TrainBranchBot) [20:39:58] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:40:26] (03Merged) 10jenkins-bot: all wikis to 1.40.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900444 (https://phabricator.wikimedia.org/T330205) (owner: 10TrainBranchBot) [20:47:49] !log brennen@deploy2002 rebuilt and synchronized wikiversions files: all wikis to 1.40.0-wmf.27 refs T330205 [20:47:55] T330205: 1.40.0-wmf.27 deployment blockers - https://phabricator.wikimedia.org/T330205 [20:48:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:49:21] (03CR) 10JHathaway: [C: 03+2] jaeger: Add network policy support [deployment-charts] - 10https://gerrit.wikimedia.org/r/900443 (https://phabricator.wikimedia.org/T320554) (owner: 10JHathaway) [20:52:10] (03CR) 10Ottomata: [C: 03+1] "We can probably do the same for analytics-research-admins" [puppet] - 10https://gerrit.wikimedia.org/r/899653 (https://phabricator.wikimedia.org/T331647) (owner: 10Muehlenhoff) [20:53:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:57:34] (03CR) 10Ayounsi: [C: 03+1] Enable OSPF check by default for l3 switch mgmt interfaces [puppet] - 10https://gerrit.wikimedia.org/r/900431 (https://phabricator.wikimedia.org/T315053) (owner: 10Cathal Mooney) [21:04:20] (03PS1) 10Ayounsi: Add cloudsw1-b1-codfw to Rancid [puppet] - 10https://gerrit.wikimedia.org/r/900448 (https://phabricator.wikimedia.org/T327919) [21:09:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:14:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:17:11] (03CR) 10Dzahn: [C: 03+1] "https://puppet-compiler.wmflabs.org/output/900313/40180/" [puppet] - 10https://gerrit.wikimedia.org/r/900313 (https://phabricator.wikimedia.org/T329678) (owner: 10Jaime Nuche) [21:19:04] (03CR) 10Dzahn: [C: 04-1] "The production deployment servers nowadays use role::deployment_server::kubernetes, but role::deployment_server is also still around, prob" [puppet] - 10https://gerrit.wikimedia.org/r/900353 (https://phabricator.wikimedia.org/T329622) (owner: 10Jaime Nuche) [21:20:51] (03CR) 10Dzahn: [C: 04-1] "deployment_server::kubernetes (the one applied on deploy1002/2002) already has include profile::docker::engine but not the prune part. " [puppet] - 10https://gerrit.wikimedia.org/r/900353 (https://phabricator.wikimedia.org/T329622) (owner: 10Jaime Nuche) [21:21:39] (03CR) 10Ahmon Dancy: [C: 03+1] deployment_server: ensure Docker is installed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/900353 (https://phabricator.wikimedia.org/T329622) (owner: 10Jaime Nuche) [21:21:43] (03CR) 10Dzahn: [C: 03+1] Adds php and apache logs for doc machines [puppet] - 10https://gerrit.wikimedia.org/r/900375 (https://phabricator.wikimedia.org/T325245) (owner: 10EoghanGaffney) [21:34:55] (03CR) 10Muehlenhoff: [C: 03+2] Add approvers for analytics-platform-eng-admins [puppet] - 10https://gerrit.wikimedia.org/r/899653 (https://phabricator.wikimedia.org/T331647) (owner: 10Muehlenhoff) [21:44:18] (03PS1) 10JHathaway: aux: add network policy for jaeger [deployment-charts] - 10https://gerrit.wikimedia.org/r/900453 (https://phabricator.wikimedia.org/T320554) [21:53:21] (03CR) 10JHathaway: [C: 03+2] aux: add network policy for jaeger [deployment-charts] - 10https://gerrit.wikimedia.org/r/900453 (https://phabricator.wikimedia.org/T320554) (owner: 10JHathaway) [21:54:50] !log jhathaway@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [22:02:56] (03PS1) 10JHathaway: aux: bump jaeger version [deployment-charts] - 10https://gerrit.wikimedia.org/r/900457 (https://phabricator.wikimedia.org/T320554) [22:04:59] !log jhathaway@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [22:08:33] (03CR) 10JHathaway: [C: 03+2] aux: bump jaeger version [deployment-charts] - 10https://gerrit.wikimedia.org/r/900457 (https://phabricator.wikimedia.org/T320554) (owner: 10JHathaway) [22:09:22] !log jhathaway@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [22:09:35] !log jhathaway@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [22:26:23] (03CR) 10Dzahn: [C: 03+1] Add doc host apache/php-fpm logs to kafka [puppet] - 10https://gerrit.wikimedia.org/r/900410 (https://phabricator.wikimedia.org/T325245) (owner: 10EoghanGaffney) [22:28:06] (03CR) 10Dzahn: [C: 03+1] Allow E_DEPRECATED logs to be shown on php-fpm in doc machines [puppet] - 10https://gerrit.wikimedia.org/r/900369 (https://phabricator.wikimedia.org/T325245) (owner: 10EoghanGaffney) [22:29:17] !log dzahn@cumin2002 START - Cookbook sre.ganeti.makevm for new host miscweb2003.codfw.wmnet [22:29:19] !log dzahn@cumin2002 START - Cookbook sre.dns.netbox [22:31:34] !log dzahn@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM miscweb2003.codfw.wmnet - dzahn@cumin2002" [22:32:40] !log dzahn@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM miscweb2003.codfw.wmnet - dzahn@cumin2002" [22:32:40] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:32:40] !log dzahn@cumin2002 START - Cookbook sre.dns.wipe-cache miscweb2003.codfw.wmnet on all recursors [22:32:43] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) miscweb2003.codfw.wmnet on all recursors [22:35:36] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm for new host miscweb1003.eqiad.wmnet [22:35:37] !log dzahn@cumin1001 START - Cookbook sre.dns.netbox [22:37:36] (03CR) 10Dzahn: [C: 04-1] deployment_server: ensure Docker is installed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/900353 (https://phabricator.wikimedia.org/T329622) (owner: 10Jaime Nuche) [22:38:35] !log dzahn@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM miscweb1003.eqiad.wmnet - dzahn@cumin1001" [22:39:42] !log dzahn@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM miscweb1003.eqiad.wmnet - dzahn@cumin1001" [22:39:42] !log dzahn@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:39:43] !log dzahn@cumin1001 START - Cookbook sre.dns.wipe-cache miscweb1003.eqiad.wmnet on all recursors [22:39:46] !log dzahn@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) miscweb1003.eqiad.wmnet on all recursors [22:42:27] !log dzahn@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host miscweb2003.codfw.wmnet [22:42:30] (03CR) 10Dzahn: deployment_server: ensure Docker is installed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/900353 (https://phabricator.wikimedia.org/T329622) (owner: 10Jaime Nuche) [22:49:29] !log dzahn@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host miscweb1003.eqiad.wmnet [22:50:46] (03PS1) 10Dzahn: set target quarter for miscweb bullseye upgrade to 2023-1 [puppet] - 10https://gerrit.wikimedia.org/r/900463 (https://phabricator.wikimedia.org/T291916) [22:53:09] (03PS1) 10Dzahn: remove role::webserver_misc_apps from sre module [puppet] - 10https://gerrit.wikimedia.org/r/900464 [22:54:21] (03PS1) 10Dzahn: miscweb: add miscweb1003/2003 to rsync_dst_hosts [puppet] - 10https://gerrit.wikimedia.org/r/900465 (https://phabricator.wikimedia.org/T331896) [22:58:00] (03PS1) 10Dzahn: site: add miscweb1003/miscweb2003 with role insetup::serviceops_collab [puppet] - 10https://gerrit.wikimedia.org/r/900486 (https://phabricator.wikimedia.org/T331896) [22:58:53] (03CR) 10Dzahn: [C: 03+2] site: add miscweb1003/miscweb2003 with role insetup::serviceops_collab [puppet] - 10https://gerrit.wikimedia.org/r/900486 (https://phabricator.wikimedia.org/T331896) (owner: 10Dzahn) [23:00:31] 10SRE: Setup PAWS internal experimentally on notebook* nodes - https://phabricator.wikimedia.org/T149543 (10nshahquinn-wmf) 05Open→03Resolved This has been done for a long time. See [wikitech:Data Engineering/Systems/Jupyter](https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Jupyter) for details. [23:00:31] !log dzahn@cumin2002 START - Cookbook sre.ganeti.reimage for host miscweb2003.codfw.wmnet with OS bullseye [23:01:35] !log dzahn@cumin1001 START - Cookbook sre.ganeti.reimage for host miscweb1003.eqiad.wmnet with OS bullseye [23:11:31] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on miscweb1003.eqiad.wmnet with reason: host reimage [23:12:31] (03PS1) 10Jdlrobson: Make messages about editing site code more prominent [core] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/900467 (https://phabricator.wikimedia.org/T311891) [23:14:50] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on miscweb1003.eqiad.wmnet with reason: host reimage [23:15:29] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on miscweb2003.codfw.wmnet with reason: host reimage [23:18:54] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on miscweb2003.codfw.wmnet with reason: host reimage [23:20:08] !log ebernhardson@deploy2002 Started deploy [airflow-dags/search@e6f0142]: bump discolytics env to 0.7.0 [23:20:27] !log ebernhardson@deploy2002 Finished deploy [airflow-dags/search@e6f0142]: bump discolytics env to 0.7.0 (duration: 00m 19s) [23:28:41] !log dzahn@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host miscweb1003.eqiad.wmnet with OS bullseye [23:31:40] !log dzahn@cumin2002 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host miscweb2003.codfw.wmnet with OS bullseye [23:32:56] (03CR) 10Dzahn: [C: 03+2] miscweb: add monitoring for annual.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/898994 (https://phabricator.wikimedia.org/T327976) (owner: 10Dzahn) [23:33:21] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 0:25:00 on lvs3007.esams.wmnet with reason: rebooting for kernel updates [23:33:36] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:25:00 on lvs3007.esams.wmnet with reason: rebooting for kernel updates [23:34:08] (03PS2) 10Dzahn: miscweb: add monitoring for bienvenida.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/898995 (https://phabricator.wikimedia.org/T327976) [23:35:57] (03PS2) 10Dzahn: remove role::webserver_misc_apps from sre module [puppet] - 10https://gerrit.wikimedia.org/r/900464 [23:36:15] (03CR) 10Dzahn: [C: 03+2] miscweb: add monitoring for bienvenida.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/898995 (https://phabricator.wikimedia.org/T327976) (owner: 10Dzahn) [23:36:47] PROBLEM - BGP status on cr3-esams is CRITICAL: BGP CRITICAL - AS64600/IPv4: Connect - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [23:37:11] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [23:37:11] ^ expected [23:38:23] RECOVERY - BGP status on cr3-esams is OK: BGP OK - up: 20, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [23:38:49] RECOVERY - BGP status on cr2-esams is OK: BGP OK - up: 477, down: 1, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [23:40:58] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 0:10:00 on lvs6003.drmrs.wmnet with reason: rebooting for kernel updates [23:41:13] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on lvs6003.drmrs.wmnet with reason: rebooting for kernel updates [23:45:55] PROBLEM - BGP status on asw1-b12-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [23:46:03] ^ expected [23:46:28] (03Restored) 10Ssingh: icinga: add ssingh to cgi.cfg [puppet] - 10https://gerrit.wikimedia.org/r/767530 (owner: 10Ssingh) [23:47:44] (03Abandoned) 10Ssingh: icinga: add ssingh to cgi.cfg [puppet] - 10https://gerrit.wikimedia.org/r/767530 (owner: 10Ssingh) [23:47:51] RECOVERY - BGP status on asw1-b12-drmrs.mgmt is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [23:50:39] (03PS1) 10Dzahn: miscweb: adjust monitoring for annual.wikimedia.org to check /2017/ [puppet] - 10https://gerrit.wikimedia.org/r/900508 (https://phabricator.wikimedia.org/T327976) [23:51:02] (03CR) 10CI reject: [V: 04-1] miscweb: adjust monitoring for annual.wikimedia.org to check /2017/ [puppet] - 10https://gerrit.wikimedia.org/r/900508 (https://phabricator.wikimedia.org/T327976) (owner: 10Dzahn) [23:51:19] (03PS2) 10Dzahn: miscweb: adjust monitoring for annual.wikimedia.org to check /2017/ [puppet] - 10https://gerrit.wikimedia.org/r/900508 (https://phabricator.wikimedia.org/T327976) [23:51:41] (03PS1) 10Ssingh: icinga: add ssingh to cgi.cfg [puppet] - 10https://gerrit.wikimedia.org/r/900509 [23:51:43] (03CR) 10CI reject: [V: 04-1] miscweb: adjust monitoring for annual.wikimedia.org to check /2017/ [puppet] - 10https://gerrit.wikimedia.org/r/900508 (https://phabricator.wikimedia.org/T327976) (owner: 10Dzahn) [23:52:33] (03PS3) 10Dzahn: miscweb: adjust monitoring for annual.wikimedia.org to check /2017/ [puppet] - 10https://gerrit.wikimedia.org/r/900508 (https://phabricator.wikimedia.org/T327976) [23:54:17] (03CR) 10Dzahn: [C: 03+2] miscweb: adjust monitoring for annual.wikimedia.org to check /2017/ [puppet] - 10https://gerrit.wikimedia.org/r/900508 (https://phabricator.wikimedia.org/T327976) (owner: 10Dzahn)