[06:47:52] there's something troubling ceph, it seems to have network issues between D5 (cloudcephosd1024) and the E4/F4 nodes [06:49:49] there's some network intermittence [06:49:51] https://usercontent.irccloud-cdn.com/file/7DWZl0Hf/image.png [06:59:00] created T371869 [06:59:01] T371869: [ceph,network] Intermittent network packets lost - https://phabricator.wikimedia.org/T371869 [06:59:30] XioNoX: topranks any of you around to help debug? ^ [07:00:39] dcaro: what are the endpoints in that graph? [07:00:53] opening task now [07:01:17] that's between ceph nodes, both internal and external interfaces, between D5 and E4/F4 racks (through cloudswitches) [07:01:49] one specific endpoint is cloudcephosd1024 <-> cloudcephosd1028 [07:02:57] on which networks / IPs? [07:03:37] 10.64.20.20 <-> 10.64.148.5 [07:03:51] it's not failing all the time though [07:04:04] yeah [07:04:08] https://www.irccloud.com/pastebin/OEPMVXK7/ [07:06:08] currently they are working ok [07:06:09] https://www.irccloud.com/pastebin/gu73A2Js/ [07:06:15] they had peaks of >4k [07:07:46] https://www.irccloud.com/pastebin/UfYkRd3g/ [07:09:23] Hmm, the lost pings are to cloudcephmon1001 [07:09:25] (mostly) [07:09:59] from cloudvirts [07:10:17] https://usercontent.irccloud-cdn.com/file/HxygsVEv/image.png [07:11:27] and there's a new alert for the cloudgw, that might be related [07:11:29] https://usercontent.irccloud-cdn.com/file/X43TMDaK/image.png [07:11:37] mtu is set to 1500 on cludcephmon1001 [07:12:25] that was like that already before I think, it's a "control" node, it does not really move data around [07:12:42] though let's double check just in case [07:15:08] pings work to it fine at 1500 byte [07:15:18] if you want to send jumbos it'd need to be bigger [07:16:32] that's ok, the lost pings are not jumbo there [07:19:17] https://usercontent.irccloud-cdn.com/file/U9TxmX7A/image.png [07:19:31] ok so it's nothing to do with jumbos then? [07:19:34] that sums per destination server and splits by size (small being non-jumbo) [07:19:41] I think it's both yes [07:22:29] right now everything in E4 pings from cloudcephosd1024 without problem [07:22:33] network looks ok that I can see [07:24:34] Hmm.... confusing [07:25:16] the times on ceph side are improving (the averages are falling down) [07:28:10] there are small numbers of discards on the c8/d5 switches on ports facing some cloudceph nodes [07:28:21] they are not new however [07:28:49] means amount of data going to those hosts is sometimes above the port rate, but it's very minor, tcp should take care of it [07:30:16] but overall I'm not seeing any issue here [07:30:23] or at least right now can't replicate any problem [07:30:28] I still see some drops towards cloudcephosd1006 [07:30:30] https://usercontent.irccloud-cdn.com/file/m6G3fjjc/image.png [07:30:33] (internal ip) [07:31:21] and towards cloudcephosd1011, from cloudvirts(using the external ip), and from other ceph nodes (using internal ip) [07:31:23] https://usercontent.irccloud-cdn.com/file/skN3VfJg/image.png [07:31:35] https://usercontent.irccloud-cdn.com/file/mDa0SsWm/image.png [07:31:55] --- cloudcephosd1006.eqiad.wmnet ping statistics --- [07:31:55] 100 packets transmitted, 100 received, 0% packet loss, time 673ms [07:31:55] rtt min/avg/max/mdev = 0.097/0.166/0.259/0.039 ms [07:32:01] being 'to the machine' instead of from it makes it a bit harder to catch [07:34:53] btw. what those ping are is just a ping themselves every 5 minutes [07:36:03] the script is called 'prometheus-node-pinger' [07:36:16] https://www.irccloud.com/pastebin/4zL8FowH/ [07:36:18] got one [07:36:27] to coludcephmon1002, from cloudvirt1055 [07:37:13] but it's one of ~50 times I've run it [07:41:10] browser died xd [07:41:40] it seems to be going down, both the amount of pings lost, and the ceph reported average time [07:41:42] https://usercontent.irccloud-cdn.com/file/ECVLFfPu/image.png [07:42:14] oh, no, one bump [07:42:16] https://www.irccloud.com/pastebin/sLXSGhfw/ [07:43:34] hmpf... it's like trying to catch raindrops [07:45:23] it spiked up again [07:45:25] https://usercontent.irccloud-cdn.com/file/WhSQr24Z/image.png [07:46:14] big one now [07:46:15] https://www.irccloud.com/pastebin/UgV6bmK3/ [07:46:17] looking [07:46:44] https://www.irccloud.com/pastebin/Ep1LII0m/ [07:47:14] https://www.irccloud.com/pastebin/iYkqiD6X/ [07:47:22] hmpf... [07:47:36] failures should be definitive, either fail, or don't xd [07:49:48] heh yeah [07:51:38] https://www.irccloud.com/pastebin/kK7jHjs6/ [07:51:52] so there's some loss [07:52:58] not all together though [07:53:00] https://www.irccloud.com/pastebin/A8rkr4nG/ [07:54:12] yeah - there are definitely discards on the switches - so we'll see some occasional loss [07:54:34] that's not new though - hasn't changed forever looking at the graphs [07:54:45] https://www.irccloud.com/pastebin/LX5DAIVe/ [07:55:00] 7 packages in a row is a bit too much though no? [07:55:14] certainly not great [07:55:29] that might be what's making the heartbeats get lost, and time out from time to time [07:55:39] too small a time interval to do any meaningful troubleshooting though [07:55:41] though as you say, that should not be new [07:55:50] so why now? [08:00:49] definitely ceph is sensitive to <7s network outages [08:00:49] Aug 06 07:31:22 cloudcephosd1033 ceph-osd[2091]: 2024-08-06T07:31:22.351+0000 7fee78af3700 -1 osd.258 54707104 heartbeat_check: no reply from 10.64.20.65:6824 osd.105 since back 2024-08-06T07:31:21.511191+0000 front 2024-08-06T07:30:58.203540+0000 [08:06:03] there's definitely an increase on packets lost, for the last week of data, you can see one ping lost now and then, but since 6:10UTC today, there's between 10 and 40 every round [08:07:32] https://usercontent.irccloud-cdn.com/file/ic651zKQ/image.png [08:08:45] what was the graphing system for the network switches? [08:10:24] hmm... it's interesting that I see the losses on a destination basis, as in, they fail in bunches depending on the destination [08:11:03] for example, all the osds fail to ping cloudcephosd1030 for a bit, then cloudcephosd1012 and such [08:11:23] but 1030 does not fail to ping the rest during that time either (or I don't see it) [08:12:56] oh, neutron is breaking again [08:13:03] I think that might be related too [08:13:15] Neutron neutron-openvswitch-agent on cloudvirt1045 is down [08:15:09] LibreNMS [08:15:15] https://librenms.wikimedia.org/device/device=184/tab=ports/view=graphs/graph=errors/ [08:15:15] https://librenms.wikimedia.org/device/device=185/tab=ports/view=graphs/graph=errors/ [08:15:15] https://librenms.wikimedia.org/device/device=241/tab=ports/view=graphs/graph=errors/ [08:15:15] https://librenms.wikimedia.org/device/device=242/tab=ports/view=graphs/graph=errors/ [08:15:37] that one thanks! (I always forget xd) [08:15:56] we have stats in prometheus for the non-cloudsw devices now which are useful - we need to upgrade the cloudsw to get that ability for them [08:16:49] nice! [08:17:03] another reason to keep pushing for the upgrade :) [08:21:03] this is new [08:21:04] summary: BGP CRITICAL - AS64605/IPv4: Active - Anycast [08:21:17] from https://alerts.wikimedia.org/?q=team%3Dwmcs [08:21:21] topranks: ^ [08:21:31] it went away now [08:21:48] | bf8f8ae1-c342-4288-b766-93008dd4cfa9 | Open vSwitch agent | cloudvirt1042 | None | XXX | UP | neutron-openvswitch-agent | [08:21:59] hmpf... the neutron agents are flapping on many hosts [08:24:01] that bgp alert was for cloudlb1002 [08:25:09] https://usercontent.irccloud-cdn.com/file/SoLdCdM9/image.png [08:25:13] that one came back too [08:28:35] hmpf.... things are flapping everywhere [08:33:46] hello! reading the backscroll... do you think cloud vps users are impacted? I don't see any service-related alerts [08:34:20] I saw a few errors on VMs failing to resolve domains (failing puppet because of that) [08:34:35] so some impact is there [08:34:56] but it was just temporary, in a couple seconds it went away [08:36:15] ok [08:36:51] there's like flakiness all around [08:37:03] I'm seeing some odd bgp flaps on cloudsw1-d5 [08:37:05] virt.cloudgw.eqiad1.wikimediacloud.org fails to ping for example only from time to time [08:37:24] like 2-3% [08:39:53] topranks: yep, bgp alert triggered again [08:39:54] https://usercontent.irccloud-cdn.com/file/yBOjDhu7/image.png [08:40:26] are you looking into that one? [08:40:33] topranks: ^ [08:42:19] yea [08:42:37] thanks 👍 [08:42:45] I'm gonna shut et-0/0/52 on cloudsw1-d5-eqiad (going to cloudsw1-e4-eqiad:et-0/0/54) [08:43:45] ack, let me know if you want me to do anything specific [08:44:09] nothing just yet [08:44:31] so what I did see was spikes of latency pinging over that link.... not really sure there is an issue with the link [08:44:50] it's almost like cloudsw1-d5, or maybe one of the others, has a busy cpu and is sometimes dropping bgp sessions [08:44:59] but cpu graphs don't reflect that or cli checks on same [08:48:40] could it be a faulty cable? [08:48:50] nah I don't think so [08:49:02] seems bfd is failing bad on cloudsw1-d5 [08:49:07] still happening with that link down [08:49:12] oh, not good [08:49:17] https://www.irccloud.com/pastebin/skMCttUk/ [08:49:46] that means is flapping every 2min or so right? [08:50:44] that would meen intermittent drops between d5 and e4/f4 with up to 21s? [08:50:58] *mean [08:51:33] (that would kind of match what we see I think) [08:54:24] yeah it would cause intermittent changes to routing, and packet loss during those reconvergence events [08:56:19] so... I reduced the frequency of the bfd keepalives on cloudsw1-d5 so it's less of them to process [08:56:19] seems to be more stable [08:57:04] the engine for it on those trident 2's has always been poor, min interval was 1 second anyway [08:57:48] thanks! [08:57:55] what does it mean "trident 2's" ? [08:58:08] oh, the model of the switches? [08:58:34] ack [08:58:55] * dcaro keeping an eye everywhere see how things start improving [08:59:31] yeah - the Juniper QFX5100 [08:59:42] The forwarding ASIC in it is a Brodcom Trident 2 [09:00:07] the switches in e4/f4 have the newer trident3 in them which we've generally had better experience with [09:02:58] I got kicked out of an ssh session with a VM [09:03:10] and ceph had 11 osds down for a split second (enough to trigger a rebalance) [09:03:15] and we got slow ops now xd [09:03:18] but it's restoring now [09:06:31] I've increased those keepalives from every 1 second to every 5 across the board now for consistency [09:06:44] although only cloudsw1-d5 seemed to be having problems keeping up [09:07:03] that one remains stable after the changes [09:07:47] no actually there was a flap 50 seconds ago :( [09:11:05] :/ [09:11:26] yep, things are still unstable [09:13:27] yeah cloudsw1-d5 is sick [09:13:36] it's now dropping the LAG interfaces [09:14:06] https://phabricator.wikimedia.org/P67226 [09:14:18] the physical members aren't dropping - just the logical bundle [09:14:29] which probably means it's having issues with the LACP message processing [09:14:45] oh, that is definitely not good [09:17:09] completely aside, someone has been touching the puppetservers git repos as root? I have fixed already 2 of them due to permissions (files owned by root instead of gitpuppet) [09:18:17] topranks: so what would be our options then? do we have to RMA the switch? [09:18:43] nah a reboot is likely all it needs [09:19:10] interesting, that will bring down the whole D5 rack right? [09:19:33] well a reboot would be the first step anyway [09:19:34] yeah [09:19:56] could be a hw issue requiring RMA of course, we'll know that if it returns after reset [09:21:53] okok, let me check what's there [09:22:03] how long would hopefully a reboot take? [09:23:39] about 15 mins usually [09:25:18] that's too much yep, that means outage [09:25:36] let me see what's in that rack [09:27:16] we can't drain it without taking ceph down, so there will be outage for sure, we should plan that with some care [09:29:23] topranks: were you able to create a task for it? I'll link to ours and create a subtask to start planning/assesing the reboot [09:29:37] sure [09:30:18] I'm thinking that the best time might be tomorrow morning, unless things degrade more [09:30:29] I'm just changing the lacp timers to maybe give us some more stablility [09:30:30] (there's less traffic on toolforge side I think) [09:31:48] +1 to scheduling the reboot for tomorrow morning, unless things get worse [09:33:09] dhinus: do you know what is cloudcontrol1008-dev? https://netbox.wikimedia.org/dcim/racks/39/ [09:33:16] dcaro: re:puppetservers, I saw this was merged last night, maybe related? https://gerrit.wikimedia.org/r/c/operations/puppet/+/1055502 [09:34:06] dcaro: never heard of cloudcontrol1008-dev [09:34:08] hmm, it still uses the gitpuppet user though [09:34:18] but maybe andre.w did some tests or something [09:34:35] elukey may have some insight on the puppetserver thing [09:34:51] https://phabricator.wikimedia.org/T342455 [09:35:56] "These servers are going to be part of the eqiad2dev deployment, and should get the -devprefix on them, for example cloudcontrol1008-dev." [09:36:23] oh, I see, it was for the openstack on k8s tests [09:36:29] so currently idle [09:37:20] yep, "role(insetup::wmcs)" [09:38:19] I find the `dev` suffixes a bit confusing, and limiting after, we have the cluster names to distinguish them already (so maybe coludcontrol1008-eqiad2 might be more appropriate) [09:38:27] anyhow, thanks :) [09:42:27] dcaro: yep agreed, I don't know if we have a max length for hostnames though [09:48:01] created T371878 for the reboot [09:48:01] T371878: [network,D5] reboot cloudsw-d5 - https://phabricator.wikimedia.org/T371878 [09:48:26] I'm thinking that we might just have to be down for that time, I don't think that ceph will be able to handle it, and with it goes every cloudvps project [09:52:27] topranks: ceph seems more stable yes, and the network agents too (and lost ping) [09:52:29] https://usercontent.irccloud-cdn.com/file/7IGqKItq/image.png [09:53:03] yeah the inter-switch link has been stable now for ~21mins [09:53:15] since I changed the LACP mode from 'fast' to 'slow' keepalive messages [09:53:24] it's the switch still dropping the LAG interfaces? [10:03:01] ceph has fully recovered now yep :) [10:03:36] dcaro: they haven't dropped in 30 mins or so [10:03:37] https://phabricator.wikimedia.org/T371879 [10:03:45] topranks: is there any way to get those logs to trigger alerts? Maybe there's some stat or something? [10:04:07] it would have helped debugging (it was a tricky issue to debug) [10:04:20] hmm, do I have access to the switch at all? [10:04:26] (/me thinks of maybe having a cookbook) [10:04:39] your username isn't there no [10:05:28] you can add yourself in the homer/public repo: [10:05:28] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/homer/public/+/refs/heads/master/config/common.yaml#44 [10:06:27] 👍 [10:06:37] in terms of alerts I can see how the SNMP traps are set up, we may have missed the short flaps when we poll [10:06:54] yep, flaps are tricky [10:10:08] topranks: https://gerrit.wikimedia.org/r/c/operations/homer/public/+/1060087 added myself [10:10:20] just in case [10:12:40] ok cool, I'll merge that and push it out in a short while [10:14:28] thanks! :) [10:17:44] only the lag alert left :), time to go for lunch [10:18:01] * dcaro lunch [13:23:32] topranks: I added you to our daily sync, so we can talk a bit about the reboot of the switch [13:23:49] (if you can make it, if not it's ok, we will sync offline) [13:37:15] dcaro: should be ok, thanks [14:35:46] topranks: coming? [14:46:34] dcaro: sorry! [14:46:46] I thought it was at the top of the hour - read my calendar wrong [14:47:02] ran out to do a quick errand only back now [14:47:24] topranks: we're still in the meet [15:21:32] dhinus, cloudcumin1001 is telling me 'requests.exceptions.InvalidSchema: Missing dependencies for SOCKS support' is that me doing something silly? [15:22:18] andrewbogott: no, I think that's probably some recent change in wmcs-cookbooks [15:22:35] dcaro, ^ that related to the change you just linked us to? [15:23:08] I have not merged those yet [15:23:13] but might have been something from before [15:23:29] andrewbogott: do you have the stack trace? [15:23:33] yep! [15:23:51] https://www.irccloud.com/pastebin/uPp4bk9M/ [15:24:35] probably best if someone confirms that happens for them too before we start debugging -- could be something messed up with my env [15:24:47] which cookbook are you using? [15:24:56] oh I see it at the top [15:25:20] `Loading socks proxy config from /etc/spicerack/wmcs.yaml` that might be it [15:25:21] I get the same error [15:26:26] I added that file, and it seems to be triggering the proxy feature (that it seems not supported in the requests version installed there :/ ) [15:26:30] let me patch it out [15:27:04] maybe merging your existing patch would actually fix it? [15:27:58] Yeah, when I asked if it was 'related' to dcaro's patch I think I meant 'fixed by' [15:28:39] this should fix it https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1060130 [15:29:03] dhinus: it will not, as it still tries to load the proxy, this one will skip it unless there's any proxy-specific confdig [15:29:06] *config [15:29:39] yep makes sense [15:31:20] can I patch stuff directly on the coludcumin? [15:32:02] puppet will reset it I think [15:32:47] is there a process to test stuff there? [15:32:53] test-cookbook works [15:33:02] guys can I hold off in adding the second-nic network config for cloudcephosd1036 ? [15:33:06] like in the cumin hosts, you pass it a patch number [15:33:17] it's connected to the not-so-healthy switch, so I'm figuring the less changes to it the better [15:33:38] and also I assume it's not much help to us to survive the reboot if it's in that rack [15:33:42] topranks: that's ok yes, we were not going to setup that one either until after the reboot [15:33:50] ok cool yep [15:33:59] dhinus: okok [15:36:06] dhinus: nice, it worked fine :) [15:39:13] shall I merge it? [15:43:56] I think it should be on it? (zuul gate) [15:44:20] 17:41:13 pylint: no-member / Module 'gitlab' has no 'Gitlab' member (col 22) [15:44:27] again... on a non-related change [15:44:49] if I add the `pylint: disable=no-member` then it fails saying that the disable it's not needed xd [15:45:02] :( [15:45:20] shoot, I can't drain from cloudcumin [15:45:21] spicerack.icinga.IcingaError: Unable to read command_file configuration in /etc/icinga/icinga.cfg [15:45:41] the gate-and-submit checks are different from the checks that +2'd it before? [15:46:56] nope, just random failure :/ [15:47:11] (as in failing randomly, the failure is the same) [15:48:28] ok cloudcephosd1035, cloudcephosd1037, cloudcephosd1038 and cloudcephosd1039 now all have their second NIC configured on the switch side correctly [15:48:30] I think I have a "fix" for the issue [15:48:35] topranks: thanks! [15:48:38] that was quick :) [15:48:51] I updated netbox for cloudcephosd1036, but didn't push the changes to the faulty switch [15:49:19] ack, can you add a note in the task so we don't forget when we set it up (after the reboot) [15:50:19] sure [15:50:57] If anyone wants to nominate someone to sit on the Toolforge standards committee, edits at https://wikitech.wikimedia.org/wiki/Help_talk:Toolforge/Toolforge_standards_committee#August_2024_committee_nominations are most welcome. I will do the leg work to follow up with folks who are nominated to get them to counter sign or reject the nomination. [15:51:38] thank you bd808! [15:51:41] The only person I know not to bug about this is anticompositenumber. They have an objection to the Volunteer NDA [15:52:07] blancadesal: thanks for moving this! [15:53:35] dhinus: andrewbogott silly fix for the ci issue https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1060138/1 [15:53:41] (passed that part of the tests) [15:54:47] that is indeed silly [15:57:08] oops, that was for bd808, thanks for moving this! [16:01:18] dcaro: I can set_maintenance again! ty [16:02:19] Thanks andrewbogott and dcaro. JJMC89's fatalism about the state of the committee a few weeks ago finally goaded me into try something. I should have taken action ~4 years ago, but now is better than never. ;) [16:04:40] hmmm... so the only place I can run the ceph cookbooks is on my laptop... [16:04:54] dhinus: ^ do we have a wmcs-cookbooks enabled host that can mess with icinga? [16:05:56] nope, only alertmanager [16:06:25] icinga was harder to enable because it's only via ssh at the moment [16:06:56] while alertmanager has an API that should now be allowed from cloudcumin* [16:07:22] and of course the hope is that icinga will at some point go away :D [16:08:06] okok, so until then, andrewbogott all the ceph cookbooks have to run on your laptop [16:08:12] (fyi) [16:08:27] hm, I don't think I have a working env at the moment so that will take me a while. [16:10:23] dhinus: even it it requires hacking out the icinga silencing, I really think we need to be able to run cookbooks on shared/standard host [16:11:10] I can try to hack around that, might not be pretty [16:11:18] andrewbogott: [16:11:29] can we maybe just manually downtime? [16:11:58] cloudcumin does not have a ssh key that can log into the icing host [16:12:02] the cookbook is the one failing when trying [16:12:08] yes or just live with the alerts [16:12:21] yep, but has to be worked around in the cookbook (that I though already was :/) [16:12:22] as long as they don't page [16:12:27] https://www.irccloud.com/pastebin/nNXuzVMG/ [16:12:38] right because now it tries to contact icinga [16:12:44] but maybe it fails somewhere else, it was complaining about some icinga config [16:14:07] Sorry, I don't mean you need to hack around it right this minute so I can run the cookbooks, just that having cloud-cumin still not work a year after we set it up means it's time for desperate workarounds :) [16:14:37] we should be able to run the ceph cookbooks somewhere it's not our laptops :) [16:16:58] andrewbogott: taavi fixed *most* of the cookbooks that were not working from cloudcumins, I think only the ceph ones are left because of the icinga issue [16:17:21] I'll open a task to track that some are _still_ not working, if we don't have a task already [16:17:25] I was supposed to fix those [16:17:32] (and I thought I did) [16:18:52] are there correct-ish docs about setting up local wmcs-cookbooks? [16:19:00] * andrewbogott searching, finding lots of fragmented/obsolete things [16:19:45] andrewbogott: there's a script that should do all the work for you [16:19:59] though spicerack fails to install on python >=3.12 [16:20:09] (you'll have to hack around in the venv) [16:20:43] dhinus: what's the best way to know if you are running on a cloudcumin? [16:20:47] just check the hostname? [16:21:06] dcaro: what/where is the script? [16:21:10] let me think. you could maybe check if you're using the proxy? [16:21:23] andrewbogott: in the same repo [16:22:07] https://gerrit.wikimedia.org/r/plugins/gitiles/cloud/wmcs-cookbooks/+/refs/heads/main/utils/generate_wmcs_config.sh [16:22:28] https://gerrit.wikimedia.org/r/plugins/gitiles/cloud/wmcs-cookbooks/+/refs/heads/main#installation [16:22:36] ^ that's more or less ok [16:23:18] dcaro: self.spicerack will likely contain some attributes you can use to determine if you're on a cloudcumin [16:23:30] but I'm not sure what's the cleanest way [16:23:57] maybe we should just check capabilities and not where it's running? [16:25:52] where are the capabilities defined? [16:26:44] I would say though that out of cloudcumin, you still want to fail if icinga is not reachable [16:27:14] yes, what I meant was that if cookbook X needs feature Y, it could check if feature Y is available, instead of checking which host it's running on [16:27:25] so e.g. checking if it can ssh to icinga [16:27:46] otherwise, I think checking if the proxy is defined seems a reasonable way to determine if you're running from a lapotp [16:28:02] off topic: there's a replication alert for toolsdb, looking [16:28:09] dcaro: your docs suggest that I run 'wmcs utils/generate_wmcs_config.sh' as though 'wmcs' is something [16:28:22] that might be a typo [16:28:53] 'bash' I assume? [16:29:12] or just run the script directly, I'll fix the docs [16:30:02] dhinus: you'll never get an error if you are checking if feature X should be working by checking if feature X works xd [16:30:16] toolsdb issue: the replica host logged a clean shutdown of the mariadb service, not sure what triggered it [16:30:20] "/opt/wmf-mariadb104/bin/mysqld (initiated by: unknown): Normal shutdown" [16:30:31] I restarted the unit with sudo systemctl start mariadb [16:32:28] "systemd-logind[735]: Power key pressed." [16:32:40] funny being a virtual machine :D but I guess something triggered a VM shutdown [16:32:52] andrewbogott: any idea on what could cause it? [16:33:14] cloudvirt migration maybe? [16:33:30] I live migrated it, that shouldn't have caused a shutdown [16:33:42] or at least I /tried/ to live migrate it, and it asked for confirmation after [16:34:09] Well, I live migrated everything on the switch-affected cloudvirts. Know the ID of that host? [16:34:57] "systemd-logind[735]: Power key pressed with a virtual finger." ;) [16:35:25] :D [16:35:41] andrewbogott: let me find the id [16:36:06] a reboot is enough to stop replication, because mariadb doesn't start automatically on boot [16:37:14] andrewbogott: tools-db-3, with id i-0009a1ea [16:39:06] ok, yes, that's one of the ones I migrated [16:39:23] I do not know why it rebooted, it shouldn't have [16:40:12] andrewbogott: cookbook readme fix https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1060143 [16:40:20] well, I see that others also rebooted... [16:40:32] So 'openstack server migrate' must do a different thing again [16:40:37] Sorry for the noise dhinus [16:40:45] it lost the 'live' [16:41:32] andrewbogott: no worries, it was an easy fix :) [16:41:37] https://www.irccloud.com/pastebin/V8VMITR1/ [16:41:42] it has that option [16:41:59] yeah but it's been showing deprecation warnings when I use it... [16:42:25] there's two others [16:42:26] https://www.irccloud.com/pastebin/i8cqbYpR/ [16:42:31] not sure what's the different [16:42:34] *difference [16:43:21] Anyway, I guess my scripted migration must've cold-migrated things even though afterwards it asked me to confirm resize (which is typically the behavior after a live migration) [16:43:28] So I guess I must've rebooted a bunch of things :( [16:45:13] dhinus: https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1060144 that uses the hostname, there's no easy way to check if the proxy should be used as of right now, it does not register itself anywhere [16:46:16] dcaro: good enough for now, we can think of other ways later [16:46:35] I like the idea of the capabilities though, maybe we can add something like `with_icinga` to `wmcs.yaml` and use that from the config instead [16:48:23] hey, I think cookbooks are now working thanks to dcaro's new docs [16:49:18] \o/ [16:50:56] We have some spare cloudvirts, I'm going to pool them and then fully drain affected cloudvirts (hopefully if I use the cookbook it won't randomly reboot things) [16:58:41] dcaro: yep I think adding something to the yaml files would work [17:02:39] * andrewbogott gets some lunch while cloudvirts drain [17:07:34] andrewbogott: cloudcephosd1011 is draining (from my laptop) it will take a while, I'll try to check before I go to bed to see if it finished and let you know if you can do another, I'm stepping away for today, long day.... [17:07:43] * dcaro off [17:08:09] (feel free to page me if needed, ex. ceph goes crazy) [17:12:50]