[08:27:27] ah, i see what you meant volans by "a little surprise" :D [08:27:38] thanks! [08:27:51] was about to ping you here :) [08:28:05] the CR volume did the job for you :D [08:28:13] I just sent a line of patches with what we discussed at the last meeting [08:28:32] * arnaudb has a lot to read [08:28:32] I've designed them to simplify the review, at least I hope [08:30:21] copy mysql.Mysql to mysql_legacy.MysqlClient, use it, remove mysql, rename mysql_legacy to mysql [08:30:36] I have the cookbook's ones too, just fixing the commit messages for the correct depends-on [08:30:53] 😱 omg [08:47:21] Emperor: o/ moving ms-be2083 to UEFI (provision+reimage) [08:54:09] good luck! [09:04:51] arnaudb: thanks for the realtime reviews, I'll probably wait to see if ma.nuel has some comments and then we can plan the merge/release/deploy dance with the related ones for the cookbooks [09:15:07] yep :) wanted to familiarize myself with the shiny new tools [09:15:35] but happy to dance whenever! [09:22:40] what do you think of the general idea/ [09:22:41] ? [09:23:50] it's very well timed, as we'll be soon onboarding a new person that will just be able to jump in a way more lisible codebase ^^ it makes more sense that the 2 classes are merged under a generic naming [09:24:08] readable* [09:24:45] cookbooks will be even simpler, its a huge win [10:26:58] Emperor: I updated the task (https://phabricator.wikimedia.org/T371400#10295522), but tl;dr is that even with UEFI the /dev/disk/by-path duplication still happens [10:34:56] :( [10:36:16] how's your ruby? [11:13:19] PROBLEM - MariaDB sustained replica lag on s1 on db1206 is CRITICAL: 54.2 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [11:17:19] RECOVERY - MariaDB sustained replica lag on s1 on db1206 is OK: (C)10 ge (W)5 ge 0.4 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [11:20:05] Emperor: sorry missed the update, it is very rusty but I can help reviewing if needed! [11:20:48] DYK where puppet puts custom fact files on the host if I want to test one? [11:26:13] ah, /var/lib/puppet/lib/facter [11:41:42] yo, what's happening with pc1017? Anything on-call should do/be aware of? [11:50:27] elukey: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1087891 ? I think that's a plausible approach (and works on both old & new kit), and avoids having to basically entirely rewrite the whole fact [11:50:53] claime: As far as I know it was being reimaged by arnaudb [11:51:55] ack [11:56:03] arnaudb: ^ please disable notifications there [12:03:56] ack will do [12:04:40] thanks [12:05:27] marostegui: db1206 is not being reimaged, it was pooled [12:05:41] arnaudb: ? [12:05:42] its a dump thing maybe? [12:05:50] https://phabricator.wikimedia.org/P70957 [12:05:54] my commit rn ↑ [12:06:20] oh sorry [12:06:21] arnaudb: I am talking about claime's comment [12:06:26] I just realized [12:06:34] let me take care of it [12:08:00] <3 thank you [12:22:57] dbctl has uncommitted changes [12:23:05] is that known? [12:23:56] it is, it is also fixed [12:24:56] thanks [12:25:14] volans: everything went OK on 1087895, thanks for the hotfix [12:26:47] nice, merging it then [12:36:21] PROBLEM - MariaDB sustained replica lag on s1 on db1206 is CRITICAL: 68.6 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [12:36:30] its recovering, icinga is lagging [12:41:21] RECOVERY - MariaDB sustained replica lag on s1 on db1206 is OK: (C)10 ge (W)5 ge 0.4 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [13:01:47] PROBLEM - MariaDB sustained replica lag on s4 on db1221 is CRITICAL: 12.8 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1221&var-port=9104 [13:02:10] checking [13:02:32] false positive ↑ [13:02:58] https://phabricator.wikimedia.org/P70965 [13:04:47] RECOVERY - MariaDB sustained replica lag on s4 on db1221 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1221&var-port=9104 [13:06:34] arnaudb: what do you mean by false positive? [13:07:38] it was a "blip" [13:17:37] FIRING: SystemdUnitFailed: pt-heartbeat-wikimedia.service on pc1017:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:19:08] 🤔 [13:22:02] ah ok because I can totally see the lag, just didn't last that long: https://grafana.wikimedia.org/goto/TakobZGHg?orgId=1 [13:23:45] PROBLEM - MariaDB sustained replica lag on s1 on db1206 is CRITICAL: 10.2 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [13:23:50] yeah it was a false positive in the sense of the duration ^^ prometheus introduced the "for:" duration for this kind of use case [13:23:58] RESOLVED: SystemdUnitFailed: pt-heartbeat-wikimedia.service on pc1017:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:24:04] well, sigh. [13:25:20] "Creating sort index SELECT /* WikiExporter::dumpPages" [13:26:03] Amir1: marostegui processlist.log & fullprocesslist.log are available in my homedir on db1206 if needed [13:26:17] (it has recovered) [13:27:37] FIRING: SystemdUnitFailed: pt-heartbeat-wikimedia.service on pc1017:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:27:45] RECOVERY - MariaDB sustained replica lag on s1 on db1206 is OK: (C)10 ge (W)5 ge 2.6 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [13:43:44] elukey: if you're happy with that updated fact, LMK and then you can try re-imaging ms-be2083 to see if it now works OK? [13:48:38] arnaudb: Probably report on the ticket about it too [13:48:40] So they are aware [14:02:14] good idea {done} [14:07:47] PROBLEM - MariaDB sustained replica lag on s1 on db1206 is CRITICAL: 12.6 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [14:11:47] RECOVERY - MariaDB sustained replica lag on s1 on db1206 is OK: (C)10 ge (W)5 ge 1 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [14:22:37] RESOLVED: SystemdUnitFailed: pt-heartbeat-wikimedia.service on pc1017:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:23:58] FIRING: SystemdUnitFailed: pt-heartbeat-wikimedia.service on pc1017:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:29:42] Emperor: replied with an alternative, but in the end your preference should be the final choice [14:34:25] Hm. [14:44:22] elukey: AFAICT your approach doesn't work at all because it ends up selecting everything regardless [14:45:30] Emperor: I haven't tested the code but if the regex is applied it doesn't select all in theory (the regex is the only one I tested and it does select what's needed) [14:45:31] elukey: I did "swift_disks[:accounts] = devices.select {|d| d.scan('/zerg/') }.sort" and then accounts ends up containing everything [14:46:03] so that regex should match _nothing_ but still everything is being selected [14:47:14] [which I ended up trying because it was matching everything and I couldn't figure out why, so I thought I'd try something that shouldn't match at all] [14:48:19] ah I was reading https://apidock.com/ruby/v2_2_9/String/scan [14:48:51] does it return the entire string if not matched? [14:49:04] it would explain but at the same time I'd be really puzzled [14:49:19] I assumed that scan returns empty/nothing if the regex doesn't apply [14:50:53] yeah tried in irb, it returns [] if not matching [14:53:13] I think the problem is that [] evaluates as True [14:53:49] lol yes [14:53:54] I added != [] and it worked [14:53:59] what the hell.. [14:54:03] https://phabricator.wikimedia.org/P70971 [14:54:39] yep but if you try devices.select {|d| d.scan('/zerg/') != [] }.sort it should work [14:55:03] yeah, then I get an empty list back. [14:55:14] maybe there is a .empty() or something to use, a little cleaner [14:56:04] yes in our case .any? [14:56:09] instead of != [] [14:56:17] and then everything should work [14:58:18] no, devices.select {|d| d.scan('/.*/').any? }.sort returns nothing [14:59:27] ah, need to lose the quotes [15:00:44] exactly yes [15:02:06] swift_disks[:accounts] = devices.select {|d| d.scan(/(ata-\d+\.\d+|scsi-\d+:\d+:\d+:\d+)-part4/).any? }.sort [15:02:25] works, although I'm not entirely sure it's a model of readability [15:03:04] it is more specific than .endswith in my opinion, and it tells you exactly what we are looking for [15:03:23] it feels more robust for future changes etc.. [15:07:10] elukey: there you go, then, revised patch pushed to gerrit for your delectation :) [15:08:56] TIL a tiny smidge more about ruby, so thanks :) [15:09:25] Emperor: LGTM! Not sure if there is a way to add specs but we can check later on [15:10:20] at this point I can retry the reimage [15:10:42] elukey: just let me do the merge [15:11:56] elukey: {{done}}, please go ahead :) [15:12:06] ackkk [15:13:17] started [15:13:23] * Emperor crosses digits, goes to make some tea [15:27:28] Emperor: progress but https://phabricator.wikimedia.org/P70973 [15:27:45] does it ring a bell? [15:35:58] Huh, no. [15:36:25] (but joe did the original work on the new-style swift backend puppetry) [15:41:37] so the swift_disks fact in https://puppetboard.wikimedia.org/node/ms-be2083.codfw.wmnet looks reasonable [15:41:49] PROBLEM - MariaDB sustained replica lag on s1 on db1206 is CRITICAL: 94 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [15:43:58] FIRING: [2x] SystemdUnitFailed: pt-heartbeat-wikimedia.service on pc1017:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:44:25] elukey: I'm guessing that it's the logic in modules/profile/manifests/swift/storage/configure_disks.pp [15:45:03] yep yep but I wanted to check that the new facts didn't trigger any duplicates etc.. [15:45:19] maybe the first time puppet run it could lead to these issues [15:45:19] you see it wants to split the string up based on assumptions about structure that I think aren't true [15:45:41] ahhh right [15:45:47] it assumes the scsi use case sigh [15:46:13] $idx = String(Integer($partition.split(/:/)[-2]) % 2) [15:46:36] ok we need to adjust that [15:46:41] another patch :D [15:46:48] 😿 [15:51:49] RECOVERY - MariaDB sustained replica lag on s1 on db1206 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [16:01:44] elukey: pushing something now [16:03:34] argh, I need some puppet horror not ruby horror [16:04:35] there is also the inline ruby option, in case [16:19:34] elukey: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1087935 is I think correct (I added a comment because the behaviour of .match is ... not entirely obvious) [16:23:06] elukey: it's a noop on ms-be1082 which is what we want (I don't think I can usefully test again ms-be2083, but I guess you can reimage once merged?) [16:25:11] Emperor: +1ed, once merged we can just run puppet on ms-be2083 via install-console [16:26:22] elukey: merged, good luck... [16:32:49] PROBLEM - MariaDB sustained replica lag on s1 on db1206 is CRITICAL: 20.4 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [16:36:49] RECOVERY - MariaDB sustained replica lag on s1 on db1206 is OK: (C)10 ge (W)5 ge 3.4 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [16:51:39] elukey: puppetboard seems to think a puppet run has completed without 8 [16:51:47] without 🔥 even [16:52:29] the first puppet run failed in some bits, I am running the second [16:52:36] ah no all good [16:52:41] \o/ [16:53:58] so ms-be2083 is ready to get ssh conns, but df -h returns no swift partition [16:54:39] now it shows the partitions, I had to manually sudo mount -a [16:54:53] elukey: that's expected on swift nodes [16:55:11] super, can you check the host and verify that all looks good? [16:55:26] I can also kick off another reimage just to be sure [16:55:58] let me have a quick poke round, then yes I think that'd be sensible to make sure this success didn't depend on some weird state in the middle [16:57:12] if all works, the new workflow should be [16:57:16] 1) run provision to set UEFI [16:57:26] 2) manually set the JBOD disks [16:57:29] 3) reimage [16:57:53] elukey: that node looks good to me, so I think go for a reimage, just to check it's repeatable? [16:58:40] and if so we'll want to adjust modules/profile/data/profile/installserver/preseed.yaml to point all the new ms and thanos be nodes to use the new setup [16:59:01] kicked off the reimage [16:59:23] +1 about preseed yes, so Papaul/Jenn will be able to set-up all the others [16:59:34] I'll prepare a patch [17:07:33] elukey: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1087949 when you've a moment [17:08:40] and thanks again for all your help with this! [18:10:07] Emperor: sorry I was in a meeting, I'll review it tomorrow and ping Dcops to fix all the other nodes! [18:10:11] going afk for baby duties :) [18:10:34] the reimage completed successfully btw [18:10:37] \o/ [18:19:14] *\o/* [19:01:18] PROBLEM - MariaDB sustained replica lag on s1 on db1206 is CRITICAL: 10.2 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [19:03:18] RECOVERY - MariaDB sustained replica lag on s1 on db1206 is OK: (C)10 ge (W)5 ge 3.8 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [19:26:16] PROBLEM - MariaDB sustained replica lag on s1 on db1206 is CRITICAL: 18 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [19:33:18] RECOVERY - MariaDB sustained replica lag on s1 on db1206 is OK: (C)10 ge (W)5 ge 1 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [19:47:37] FIRING: [2x] SystemdUnitFailed: pt-heartbeat-wikimedia.service on pc1017:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:56:18] PROBLEM - MariaDB sustained replica lag on s1 on db1206 is CRITICAL: 127.8 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [20:03:16] RECOVERY - MariaDB sustained replica lag on s1 on db1206 is OK: (C)10 ge (W)5 ge 4 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [20:46:18] PROBLEM - MariaDB sustained replica lag on s1 on db1206 is CRITICAL: 15.4 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [21:03:16] RECOVERY - MariaDB sustained replica lag on s1 on db1206 is OK: (C)10 ge (W)5 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [21:29:18] PROBLEM - MariaDB sustained replica lag on s1 on db1206 is CRITICAL: 39.2 ge 10 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [21:34:18] RECOVERY - MariaDB sustained replica lag on s1 on db1206 is OK: (C)10 ge (W)5 ge 0.2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1206&var-port=9104 [23:47:37] FIRING: [2x] SystemdUnitFailed: pt-heartbeat-wikimedia.service on pc1017:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed