[08:27:27] <arnaudb>	 ah, i see what you meant volans by "a little surprise" :D
[08:27:38] <arnaudb>	 thanks!
[08:27:51] <volans>	 was about to ping you here :)
[08:28:05] <arnaudb>	 the CR volume did the job for you :D
[08:28:13] <volans>	 I just sent a line of patches with what we discussed at the last meeting
[08:28:32] * arnaudb has a lot to read
[08:28:32] <volans>	 I've designed them to simplify the review, at least I hope
[08:30:21] <volans>	 copy mysql.Mysql to mysql_legacy.MysqlClient, use it, remove mysql, rename mysql_legacy to mysql
[08:30:36] <volans>	 I have the cookbook's ones too, just fixing the commit messages for the correct depends-on
[08:30:53] <arnaudb>	 😱 omg
[08:47:21] <elukey>	 Emperor: o/ moving ms-be2083 to UEFI (provision+reimage)
[08:54:09] <Emperor>	 good luck!
[09:04:51] <volans>	 arnaudb: thanks for the realtime reviews, I'll probably wait to see if ma.nuel has some comments and then we can plan the merge/release/deploy dance with the related ones for the cookbooks
[09:15:07] <arnaudb>	 yep :) wanted to familiarize myself with the shiny new tools
[09:15:35] <arnaudb>	 but happy to dance whenever!
[09:22:40] <volans>	 what do you think of the general idea/
[09:22:41] <volans>	 ?
[09:23:50] <arnaudb>	 it's very well timed, as we'll be soon onboarding a new person that will just be able to jump in a way more lisible codebase ^^ it makes more sense that the 2 classes are merged under a generic naming
[09:24:08] <arnaudb>	 readable*
[09:24:45] <arnaudb>	 cookbooks will be even simpler, its a huge win
[10:26:58] <elukey>	 Emperor: I updated the task (https://phabricator.wikimedia.org/T371400#10295522), but tl;dr is that even with UEFI the /dev/disk/by-path duplication still happens
[10:34:56] <Emperor>	 :(
[10:36:16] <Emperor>	 how's your ruby?
[11:13:19] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s1 on db1206 is CRITICAL: 54.2 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104
[11:17:19] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s1 on db1206 is OK: (C)10 ge (W)5 ge 0.4 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104
[11:20:05] <elukey>	 Emperor: sorry missed the update, it is very rusty but I can help reviewing if needed!
[11:20:48] <Emperor>	 DYK where puppet puts custom fact files on the host if I want to test one?
[11:26:13] <Emperor>	 ah, /var/lib/puppet/lib/facter
[11:41:42] <claime>	 yo, what's happening with pc1017? Anything on-call should do/be aware of?
[11:50:27] <Emperor>	 elukey: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1087891 ? I think that's a plausible approach (and works on both old & new kit), and avoids having to basically entirely rewrite the whole fact
[11:50:53] <marostegui>	 claime: As far as I know it was being reimaged by arnaudb 
[11:51:55] <claime>	 ack
[11:56:03] <marostegui>	 arnaudb: ^ please disable notifications there 
[12:03:56] <arnaudb>	 ack will do
[12:04:40] <marostegui>	 thanks
[12:05:27] <arnaudb>	 marostegui: db1206 is not being reimaged, it was pooled
[12:05:41] <marostegui>	 arnaudb: ?
[12:05:42] <arnaudb>	 its a dump thing maybe?
[12:05:50] <arnaudb>	 https://phabricator.wikimedia.org/P70957
[12:05:54] <arnaudb>	 my commit rn ↑
[12:06:20] <arnaudb>	 oh sorry
[12:06:21] <marostegui>	 arnaudb: I am talking about claime's comment
[12:06:26] <arnaudb>	 I just realized
[12:06:34] <arnaudb>	 let me take care of it
[12:08:00] <claime>	 <3 thank you
[12:22:57] <marostegui>	 dbctl has uncommitted changes
[12:23:05] <marostegui>	 is that known?
[12:23:56] <arnaudb>	 it is, it is also fixed
[12:24:56] <marostegui>	 thanks
[12:25:14] <arnaudb>	 volans: everything went OK on 1087895, thanks for the hotfix
[12:26:47] <volans>	 nice, merging it then
[12:36:21] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s1 on db1206 is CRITICAL: 68.6 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104
[12:36:30] <arnaudb>	 its recovering, icinga is lagging
[12:41:21] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s1 on db1206 is OK: (C)10 ge (W)5 ge 0.4 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104
[13:01:47] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s4 on db1221 is CRITICAL: 12.8 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1221&var-port=9104
[13:02:10] <arnaudb>	 checking
[13:02:32] <arnaudb>	 false positive ↑
[13:02:58] <arnaudb>	 https://phabricator.wikimedia.org/P70965
[13:04:47] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s4 on db1221 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1221&var-port=9104
[13:06:34] <volans>	 arnaudb: what do you mean by false positive?
[13:07:38] <arnaudb>	 it was a "blip"
[13:17:37] <jinxer-wm>	 FIRING: SystemdUnitFailed: pt-heartbeat-wikimedia.service on pc1017:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:19:08] <arnaudb>	 🤔
[13:22:02] <volans>	 ah ok because I can totally see the lag, just didn't last that long: https://grafana.wikimedia.org/goto/TakobZGHg?orgId=1
[13:23:45] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s1 on db1206 is CRITICAL: 10.2 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104
[13:23:50] <arnaudb>	 yeah it was a false positive in the sense of the duration ^^ prometheus introduced the "for:" duration for this kind of use case
[13:23:58] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: pt-heartbeat-wikimedia.service on pc1017:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:24:04] <arnaudb>	 well, sigh.
[13:25:20] <arnaudb>	 "Creating sort index	SELECT /* WikiExporter::dumpPages"
[13:26:03] <arnaudb>	 Amir1: marostegui processlist.log & fullprocesslist.log are available in my homedir on db1206 if needed
[13:26:17] <arnaudb>	 (it has recovered)
[13:27:37] <jinxer-wm>	 FIRING: SystemdUnitFailed: pt-heartbeat-wikimedia.service on pc1017:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:27:45] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s1 on db1206 is OK: (C)10 ge (W)5 ge 2.6 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104
[13:43:44] <Emperor>	 elukey: if you're happy with that updated fact, LMK and then you can try re-imaging ms-be2083 to see if it now works OK?
[13:48:38] <marostegui>	 arnaudb: Probably report on the ticket about it too
[13:48:40] <marostegui>	 So they are aware
[14:02:14] <arnaudb>	 good idea {done}
[14:07:47] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s1 on db1206 is CRITICAL: 12.6 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104
[14:11:47] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s1 on db1206 is OK: (C)10 ge (W)5 ge 1 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104
[14:22:37] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: pt-heartbeat-wikimedia.service on pc1017:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:23:58] <jinxer-wm>	 FIRING: SystemdUnitFailed: pt-heartbeat-wikimedia.service on pc1017:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:29:42] <elukey>	 Emperor: replied with an alternative, but in the end your preference should be the final choice
[14:34:25] <Emperor>	 Hm.
[14:44:22] <Emperor>	 elukey: AFAICT your approach doesn't work at all because it ends up selecting everything regardless
[14:45:30] <elukey>	 Emperor: I haven't tested the code but if the regex is applied it doesn't select all in theory (the regex is the only one I tested and it does select what's needed)
[14:45:31] <Emperor>	 elukey: I did "swift_disks[:accounts] = devices.select {|d| d.scan('/zerg/') }.sort" and then accounts ends up containing everything
[14:46:03] <Emperor>	 so that regex should match _nothing_ but still everything is being selected
[14:47:14] <Emperor>	 [which I ended up trying because it was matching everything and I couldn't figure out why, so I thought I'd try something that shouldn't match at all]
[14:48:19] <elukey>	 ah I was reading https://apidock.com/ruby/v2_2_9/String/scan
[14:48:51] <elukey>	 does it return the entire string if not matched?
[14:49:04] <elukey>	 it would explain but at the same time I'd be really puzzled
[14:49:19] <elukey>	 I assumed that scan returns empty/nothing if the regex doesn't apply
[14:50:53] <elukey>	 yeah tried in irb, it returns [] if not matching
[14:53:13] <Emperor>	 I think the problem is that [] evaluates as True
[14:53:49] <elukey>	 lol yes
[14:53:54] <elukey>	 I added != [] and it worked
[14:53:59] <elukey>	 what the hell..
[14:54:03] <Emperor>	 https://phabricator.wikimedia.org/P70971
[14:54:39] <elukey>	 yep but if you try devices.select {|d| d.scan('/zerg/') != [] }.sort it should work
[14:55:03] <Emperor>	 yeah, then I get an empty list back.
[14:55:14] <elukey>	 maybe there is a .empty() or something to use, a little cleaner
[14:56:04] <elukey>	 yes in our case .any?
[14:56:09] <elukey>	 instead of != []
[14:56:17] <elukey>	 and then everything should work
[14:58:18] <Emperor>	 no, devices.select {|d| d.scan('/.*/').any? }.sort returns nothing
[14:59:27] <Emperor>	 ah, need to lose the quotes
[15:00:44] <elukey>	 exactly yes
[15:02:06] <Emperor>	 swift_disks[:accounts] = devices.select {|d| d.scan(/(ata-\d+\.\d+|scsi-\d+:\d+:\d+:\d+)-part4/).any? }.sort
[15:02:25] <Emperor>	 works, although I'm not entirely sure it's a model of readability
[15:03:04] <elukey>	 it is more specific than .endswith in my opinion, and it tells you exactly what we are looking for
[15:03:23] <elukey>	 it feels more robust for future changes etc..
[15:07:10] <Emperor>	 elukey: there you go, then, revised patch pushed to gerrit for your delectation :)
[15:08:56] <Emperor>	 TIL a tiny smidge more about ruby, so thanks :)
[15:09:25] <elukey>	 Emperor: LGTM! Not sure if there is a way to add specs but we can check later on
[15:10:20] <elukey>	 at this point I can retry the reimage
[15:10:42] <Emperor>	 elukey: just let me do the merge
[15:11:56] <Emperor>	 elukey: {{done}}, please go ahead :)
[15:12:06] <elukey>	 ackkk
[15:13:17] <elukey>	 started
[15:13:23] * Emperor crosses digits, goes to make some tea
[15:27:28] <elukey>	 Emperor: progress but https://phabricator.wikimedia.org/P70973
[15:27:45] <elukey>	 does it ring a bell?
[15:35:58] <Emperor>	 Huh, no.
[15:36:25] <Emperor>	 (but joe did the original work on the new-style swift backend puppetry)
[15:41:37] <elukey>	 so the swift_disks fact in https://puppetboard.wikimedia.org/node/ms-be2083.codfw.wmnet looks reasonable
[15:41:49] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s1 on db1206 is CRITICAL: 94 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104
[15:43:58] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: pt-heartbeat-wikimedia.service on pc1017:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:44:25] <Emperor>	 elukey: I'm guessing that it's the logic in  modules/profile/manifests/swift/storage/configure_disks.pp
[15:45:03] <elukey>	 yep yep but I wanted to check that the new facts didn't trigger any duplicates etc..
[15:45:19] <elukey>	 maybe the first time puppet run it could lead to these issues
[15:45:19] <Emperor>	 you see it wants to split the string up based on assumptions about structure that I think aren't true 
[15:45:41] <elukey>	 ahhh right
[15:45:47] <elukey>	 it assumes  the scsi use case sigh
[15:46:13] <Emperor>	 $idx = String(Integer($partition.split(/:/)[-2]) % 2)
[15:46:36] <elukey>	 ok we need to adjust that
[15:46:41] <elukey>	 another patch :D
[15:46:48] <Emperor>	 😿
[15:51:49] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s1 on db1206 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104
[16:01:44] <Emperor>	 elukey: pushing something now
[16:03:34] <Emperor>	 argh, I need some puppet horror not ruby horror
[16:04:35] <elukey>	 there is also the inline ruby option, in case
[16:19:34] <Emperor>	 elukey: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1087935 is I think correct (I added a comment because the behaviour of .match is ... not entirely obvious)
[16:23:06] <Emperor>	 elukey: it's a noop on ms-be1082 which is what we want (I don't think I can usefully test again ms-be2083, but I guess you can reimage once merged?)
[16:25:11] <elukey>	 Emperor: +1ed, once merged we can just run puppet on ms-be2083 via install-console
[16:26:22] <Emperor>	 elukey: merged, good luck...
[16:32:49] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s1 on db1206 is CRITICAL: 20.4 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104
[16:36:49] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s1 on db1206 is OK: (C)10 ge (W)5 ge 3.4 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104
[16:51:39] <Emperor>	 elukey: puppetboard seems to think a puppet run has completed without 8
[16:51:47] <Emperor>	 without 🔥 even
[16:52:29] <elukey>	 the first puppet run failed in some bits, I am running the second
[16:52:36] <elukey>	 ah no all good
[16:52:41] <Emperor>	 \o/
[16:53:58] <elukey>	 so ms-be2083 is ready to get ssh conns, but df -h returns no swift partition
[16:54:39] <elukey>	 now it shows the partitions, I had to manually sudo mount -a
[16:54:53] <Emperor>	 elukey: that's expected on swift nodes
[16:55:11] <elukey>	 super, can you check the host and verify that all looks good? 
[16:55:26] <elukey>	 I can also kick off another reimage just to be sure
[16:55:58] <Emperor>	 let me have a quick poke round, then yes I think that'd be sensible to make sure this success didn't depend on some weird state in the middle
[16:57:12] <elukey>	 if all works, the new workflow should be
[16:57:16] <elukey>	 1) run provision to set UEFI
[16:57:26] <elukey>	 2) manually set the JBOD disks
[16:57:29] <elukey>	 3) reimage
[16:57:53] <Emperor>	 elukey: that node looks good to me, so I think go for a reimage, just to check it's repeatable?
[16:58:40] <Emperor>	 and if so we'll want to adjust modules/profile/data/profile/installserver/preseed.yaml to point all the new ms and thanos be nodes to use the new setup
[16:59:01] <elukey>	 kicked off the reimage
[16:59:23] <elukey>	 +1 about preseed yes, so Papaul/Jenn will be able to set-up all the others
[16:59:34] <Emperor>	 I'll prepare a patch
[17:07:33] <Emperor>	 elukey: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1087949 when you've a moment
[17:08:40] <Emperor>	 and thanks again for all your help with this!
[18:10:07] <elukey>	 Emperor: sorry I was in a meeting, I'll review it tomorrow and ping Dcops to fix all the other nodes!
[18:10:11] <elukey>	 going afk for baby duties :)
[18:10:34] <elukey>	 the reimage completed successfully btw
[18:10:37] <elukey>	 \o/
[18:19:14] <Emperor>	 *\o/*
[19:01:18] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s1 on db1206 is CRITICAL: 10.2 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104
[19:03:18] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s1 on db1206 is OK: (C)10 ge (W)5 ge 3.8 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104
[19:26:16] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s1 on db1206 is CRITICAL: 18 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104
[19:33:18] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s1 on db1206 is OK: (C)10 ge (W)5 ge 1 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104
[19:47:37] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: pt-heartbeat-wikimedia.service on pc1017:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:56:18] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s1 on db1206 is CRITICAL: 127.8 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104
[20:03:16] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s1 on db1206 is OK: (C)10 ge (W)5 ge 4 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104
[20:46:18] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s1 on db1206 is CRITICAL: 15.4 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104
[21:03:16] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s1 on db1206 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104
[21:29:18] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on s1 on db1206 is CRITICAL: 39.2 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104
[21:34:18] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on s1 on db1206 is OK: (C)10 ge (W)5 ge 0.2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104
[23:47:37] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: pt-heartbeat-wikimedia.service on pc1017:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed