[06:47:52] there's something troubling ceph, it seems to have network issues between D5 (cloudcephosd1024) and the E4/F4 nodes [06:49:49] there's some network intermittence [06:49:51] https://usercontent.irccloud-cdn.com/file/7DWZl0Hf/image.png [06:59:00] created T371869 [06:59:01] T371869: [ceph,network] Intermittent network packets lost - https://phabricator.wikimedia.org/T371869 [06:59:30] XioNoX: topranks any of you around to help debug? ^ [07:00:39] dcaro: what are the endpoints in that graph? [07:00:53] opening task now [07:01:17] that's between ceph nodes, both internal and external interfaces, between D5 and E4/F4 racks (through cloudswitches) [07:01:49] one specific endpoint is cloudcephosd1024 <-> cloudcephosd1028 [07:02:57] on which networks / IPs? [07:03:37] 10.64.20.20 <-> 10.64.148.5 [07:03:51] it's not failing all the time though [07:04:04] yeah [07:04:08] https://www.irccloud.com/pastebin/OEPMVXK7/ [07:06:08] currently they are working ok [07:06:09] https://www.irccloud.com/pastebin/gu73A2Js/ [07:06:15] they had peaks of >4k [07:07:46] https://www.irccloud.com/pastebin/UfYkRd3g/ [07:09:23] Hmm, the lost pings are to cloudcephmon1001 [07:09:25] (mostly) [07:09:59] from cloudvirts [07:10:17] https://usercontent.irccloud-cdn.com/file/HxygsVEv/image.png [07:11:27] and there's a new alert for the cloudgw, that might be related [07:11:29] https://usercontent.irccloud-cdn.com/file/X43TMDaK/image.png [07:11:37] mtu is set to 1500 on cludcephmon1001 [07:12:25] that was like that already before I think, it's a "control" node, it does not really move data around [07:12:42] though let's double check just in case [07:15:08] pings work to it fine at 1500 byte [07:15:18] if you want to send jumbos it'd need to be bigger [07:16:32] that's ok, the lost pings are not jumbo there [07:19:17] https://usercontent.irccloud-cdn.com/file/U9TxmX7A/image.png [07:19:31] ok so it's nothing to do with jumbos then? [07:19:34] that sums per destination server and splits by size (small being non-jumbo) [07:19:41] I think it's both yes [07:22:29] right now everything in E4 pings from cloudcephosd1024 without problem [07:22:33] network looks ok that I can see [07:24:34] Hmm.... confusing [07:25:16] the times on ceph side are improving (the averages are falling down) [07:28:10] there are small numbers of discards on the c8/d5 switches on ports facing some cloudceph nodes [07:28:21] they are not new however [07:28:49] means amount of data going to those hosts is sometimes above the port rate, but it's very minor, tcp should take care of it [07:30:16] but overall I'm not seeing any issue here [07:30:23] or at least right now can't replicate any problem [07:30:28] I still see some drops towards cloudcephosd1006 [07:30:30] https://usercontent.irccloud-cdn.com/file/m6G3fjjc/image.png [07:30:33] (internal ip) [07:31:21] and towards cloudcephosd1011, from cloudvirts(using the external ip), and from other ceph nodes (using internal ip) [07:31:23] https://usercontent.irccloud-cdn.com/file/skN3VfJg/image.png [07:31:35] https://usercontent.irccloud-cdn.com/file/mDa0SsWm/image.png [07:31:55] --- cloudcephosd1006.eqiad.wmnet ping statistics --- [07:31:55] 100 packets transmitted, 100 received, 0% packet loss, time 673ms [07:31:55] rtt min/avg/max/mdev = 0.097/0.166/0.259/0.039 ms [07:32:01] being 'to the machine' instead of from it makes it a bit harder to catch [07:34:53] btw. what those ping are is just a ping themselves every 5 minutes [07:36:03] the script is called 'prometheus-node-pinger' [07:36:16] https://www.irccloud.com/pastebin/4zL8FowH/ [07:36:18] got one [07:36:27] to coludcephmon1002, from cloudvirt1055 [07:37:13] but it's one of ~50 times I've run it [07:41:10] browser died xd [07:41:40] it seems to be going down, both the amount of pings lost, and the ceph reported average time [07:41:42] https://usercontent.irccloud-cdn.com/file/ECVLFfPu/image.png [07:42:14] oh, no, one bump [07:42:16] https://www.irccloud.com/pastebin/sLXSGhfw/ [07:43:34] hmpf... it's like trying to catch raindrops [07:45:23] it spiked up again [07:45:25] https://usercontent.irccloud-cdn.com/file/WhSQr24Z/image.png [07:46:14] big one now [07:46:15] https://www.irccloud.com/pastebin/UgV6bmK3/ [07:46:17] looking [07:46:44] https://www.irccloud.com/pastebin/Ep1LII0m/ [07:47:14] https://www.irccloud.com/pastebin/iYkqiD6X/ [07:47:22] hmpf... [07:47:36] failures should be definitive, either fail, or don't xd [07:49:48] heh yeah [07:51:38] https://www.irccloud.com/pastebin/kK7jHjs6/ [07:51:52] so there's some loss [07:52:58] not all together though [07:53:00] https://www.irccloud.com/pastebin/A8rkr4nG/ [07:54:12] yeah - there are definitely discards on the switches - so we'll see some occasional loss [07:54:34] that's not new though - hasn't changed forever looking at the graphs [07:54:45] https://www.irccloud.com/pastebin/LX5DAIVe/ [07:55:00] 7 packages in a row is a bit too much though no? [07:55:14] certainly not great [07:55:29] that might be what's making the heartbeats get lost, and time out from time to time [07:55:39] too small a time interval to do any meaningful troubleshooting though [07:55:41] though as you say, that should not be new [07:55:50] so why now? [08:00:49] definitely ceph is sensitive to <7s network outages [08:00:49] Aug 06 07:31:22 cloudcephosd1033 ceph-osd[2091]: 2024-08-06T07:31:22.351+0000 7fee78af3700 -1 osd.258 54707104 heartbeat_check: no reply from 10.64.20.65:6824 osd.105 since back 2024-08-06T07:31:21.511191+0000 front 2024-08-06T07:30:58.203540+0000 [08:06:03] there's definitely an increase on packets lost, for the last week of data, you can see one ping lost now and then, but since 6:10UTC today, there's between 10 and 40 every round [08:07:32] https://usercontent.irccloud-cdn.com/file/ic651zKQ/image.png [08:08:45] what was the graphing system for the network switches? [08:10:24] hmm... it's interesting that I see the losses on a destination basis, as in, they fail in bunches depending on the destination [08:11:03] for example, all the osds fail to ping cloudcephosd1030 for a bit, then cloudcephosd1012 and such [08:11:23] but 1030 does not fail to ping the rest during that time either (or I don't see it) [08:12:56] oh, neutron is breaking again [08:13:03] I think that might be related too [08:13:15] Neutron neutron-openvswitch-agent on cloudvirt1045 is down [08:15:09] LibreNMS [08:15:15] https://librenms.wikimedia.org/device/device=184/tab=ports/view=graphs/graph=errors/ [08:15:15] https://librenms.wikimedia.org/device/device=185/tab=ports/view=graphs/graph=errors/ [08:15:15] https://librenms.wikimedia.org/device/device=241/tab=ports/view=graphs/graph=errors/ [08:15:15] https://librenms.wikimedia.org/device/device=242/tab=ports/view=graphs/graph=errors/ [08:15:37] that one thanks! (I always forget xd) [08:15:56] we have stats in prometheus for the non-cloudsw devices now which are useful - we need to upgrade the cloudsw to get that ability for them [08:16:49] nice! [08:17:03] another reason to keep pushing for the upgrade :) [08:21:03] this is new [08:21:04] summary: BGP CRITICAL - AS64605/IPv4: Active - Anycast [08:21:17] from https://alerts.wikimedia.org/?q=team%3Dwmcs [08:21:21] topranks: ^ [08:21:31] it went away now [08:21:48] | bf8f8ae1-c342-4288-b766-93008dd4cfa9 | Open vSwitch agent | cloudvirt1042 | None | XXX | UP | neutron-openvswitch-agent | [08:21:59] hmpf... the neutron agents are flapping on many hosts [08:24:01] that bgp alert was for cloudlb1002 [08:25:09] https://usercontent.irccloud-cdn.com/file/SoLdCdM9/image.png [08:25:13] that one came back too [08:28:35] hmpf.... things are flapping everywhere [08:33:46] hello! reading the backscroll... do you think cloud vps users are impacted? I don't see any service-related alerts [08:34:20] I saw a few errors on VMs failing to resolve domains (failing puppet because of that) [08:34:35] so some impact is there [08:34:56] but it was just temporary, in a couple seconds it went away [08:36:15] ok [08:36:51] there's like flakiness all around [08:37:03] I'm seeing some odd bgp flaps on cloudsw1-d5 [08:37:05] virt.cloudgw.eqiad1.wikimediacloud.org fails to ping for example only from time to time [08:37:24] like 2-3% [08:39:53] topranks: yep, bgp alert triggered again [08:39:54] https://usercontent.irccloud-cdn.com/file/yBOjDhu7/image.png [08:40:26] are you looking into that one? [08:40:33] topranks: ^ [08:42:19] yea [08:42:37] thanks 👍 [08:42:45] I'm gonna shut et-0/0/52 on cloudsw1-d5-eqiad (going to cloudsw1-e4-eqiad:et-0/0/54) [08:43:45] ack, let me know if you want me to do anything specific [08:44:09] nothing just yet [08:44:31] so what I did see was spikes of latency pinging over that link.... not really sure there is an issue with the link [08:44:50] it's almost like cloudsw1-d5, or maybe one of the others, has a busy cpu and is sometimes dropping bgp sessions [08:44:59] but cpu graphs don't reflect that or cli checks on same [08:48:40] could it be a faulty cable? [08:48:50] nah I don't think so [08:49:02] seems bfd is failing bad on cloudsw1-d5 [08:49:07] still happening with that link down [08:49:12] oh, not good [08:49:17] https://www.irccloud.com/pastebin/skMCttUk/ [08:49:46] that means is flapping every 2min or so right? [08:50:44] that would meen intermittent drops between d5 and e4/f4 with up to 21s? [08:50:58] *mean [08:51:33] (that would kind of match what we see I think) [08:54:24] yeah it would cause intermittent changes to routing, and packet loss during those reconvergence events [08:56:19] so... I reduced the frequency of the bfd keepalives on cloudsw1-d5 so it's less of them to process [08:56:19] seems to be more stable [08:57:04] the engine for it on those trident 2's has always been poor, min interval was 1 second anyway [08:57:48] thanks! [08:57:55] what does it mean "trident 2's" ? [08:58:08] oh, the model of the switches? [08:58:34] ack [08:58:55] * dcaro keeping an eye everywhere see how things start improving [08:59:31] yeah - the Juniper QFX5100 [08:59:42] The forwarding ASIC in it is a Brodcom Trident 2 [09:00:07] the switches in e4/f4 have the newer trident3 in them which we've generally had better experience with [09:02:58] I got kicked out of an ssh session with a VM [09:03:10] and ceph had 11 osds down for a split second (enough to trigger a rebalance) [09:03:15] and we got slow ops now xd [09:03:18] but it's restoring now [09:06:31] I've increased those keepalives from every 1 second to every 5 across the board now for consistency [09:06:44] although only cloudsw1-d5 seemed to be having problems keeping up [09:07:03] that one remains stable after the changes [09:07:47] no actually there was a flap 50 seconds ago :( [09:11:05] :/ [09:11:26] yep, things are still unstable [09:13:27] yeah cloudsw1-d5 is sick [09:13:36] it's now dropping the LAG interfaces [09:14:06] https://phabricator.wikimedia.org/P67226 [09:14:18] the physical members aren't dropping - just the logical bundle [09:14:29] which probably means it's having issues with the LACP message processing [09:14:45] oh, that is definitely not good [09:17:09] completely aside, someone has been touching the puppetservers git repos as root? I have fixed already 2 of them due to permissions (files owned by root instead of gitpuppet) [09:18:17] topranks: so what would be our options then? do we have to RMA the switch? [09:18:43] nah a reboot is likely all it needs [09:19:10] interesting, that will bring down the whole D5 rack right? [09:19:33] well a reboot would be the first step anyway [09:19:34] yeah [09:19:56] could be a hw issue requiring RMA of course, we'll know that if it returns after reset [09:21:53] okok, let me check what's there [09:22:03] how long would hopefully a reboot take? [09:23:39] about 15 mins usually [09:25:18] that's too much yep, that means outage [09:25:36] let me see what's in that rack [09:27:16] we can't drain it without taking ceph down, so there will be outage for sure, we should plan that with some care [09:29:23] topranks: were you able to create a task for it? I'll link to ours and create a subtask to start planning/assesing the reboot [09:29:37] sure [09:30:18] I'm thinking that the best time might be tomorrow morning, unless things degrade more [09:30:29] I'm just changing the lacp timers to maybe give us some more stablility [09:30:30] (there's less traffic on toolforge side I think) [09:31:48] +1 to scheduling the reboot for tomorrow morning, unless things get worse [09:33:09] dhinus: do you know what is cloudcontrol1008-dev? https://netbox.wikimedia.org/dcim/racks/39/ [09:33:16] dcaro: re:puppetservers, I saw this was merged last night, maybe related? https://gerrit.wikimedia.org/r/c/operations/puppet/+/1055502 [09:34:06] dcaro: never heard of cloudcontrol1008-dev [09:34:08] hmm, it still uses the gitpuppet user though [09:34:18] but maybe andre.w did some tests or something [09:34:35] elukey may have some insight on the puppetserver thing [09:34:51] https://phabricator.wikimedia.org/T342455 [09:35:56] "These servers are going to be part of the eqiad2dev deployment, and should get the -devprefix on them, for example cloudcontrol1008-dev." [09:36:23] oh, I see, it was for the openstack on k8s tests [09:36:29] so currently idle [09:37:20] yep, "role(insetup::wmcs)" [09:38:19] I find the `dev` suffixes a bit confusing, and limiting after, we have the cluster names to distinguish them already (so maybe coludcontrol1008-eqiad2 might be more appropriate) [09:38:27] anyhow, thanks :) [09:42:27] dcaro: yep agreed, I don't know if we have a max length for hostnames though [09:48:01] created T371878 for the reboot [09:48:01] T371878: [network,D5] reboot cloudsw-d5 - https://phabricator.wikimedia.org/T371878 [09:48:26] I'm thinking that we might just have to be down for that time, I don't think that ceph will be able to handle it, and with it goes every cloudvps project [09:52:27] topranks: ceph seems more stable yes, and the network agents too (and lost ping) [09:52:29] https://usercontent.irccloud-cdn.com/file/7IGqKItq/image.png [09:53:03] yeah the inter-switch link has been stable now for ~21mins [09:53:15] since I changed the LACP mode from 'fast' to 'slow' keepalive messages [09:53:24] it's the switch still dropping the LAG interfaces? [10:03:01] ceph has fully recovered now yep :) [10:03:36] dcaro: they haven't dropped in 30 mins or so [10:03:37] https://phabricator.wikimedia.org/T371879 [10:03:45] topranks: is there any way to get those logs to trigger alerts? Maybe there's some stat or something? [10:04:07] it would have helped debugging (it was a tricky issue to debug) [10:04:20] hmm, do I have access to the switch at all? [10:04:26] (/me thinks of maybe having a cookbook) [10:04:39] your username isn't there no [10:05:28] you can add yourself in the homer/public repo: [10:05:28] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/homer/public/+/refs/heads/master/config/common.yaml#44 [10:06:27] 👍 [10:06:37] in terms of alerts I can see how the SNMP traps are set up, we may have missed the short flaps when we poll [10:06:54] yep, flaps are tricky [10:10:08] topranks: https://gerrit.wikimedia.org/r/c/operations/homer/public/+/1060087 added myself [10:10:20] just in case [10:12:40] ok cool, I'll merge that and push it out in a short while [10:14:28] thanks! :) [10:17:44] only the lag alert left :), time to go for lunch [10:18:01] * dcaro lunch [13:23:32] topranks: I added you to our daily sync, so we can talk a bit about the reboot of the switch [13:23:49] (if you can make it, if not it's ok, we will sync offline) [13:37:15] dcaro: should be ok, thanks [14:35:46] topranks: coming? [14:46:34] dcaro: sorry! [14:46:46] I thought it was at the top of the hour - read my calendar wrong [14:47:02] ran out to do a quick errand only back now [14:47:24] topranks: we're still in the meet [15:21:32] dhinus, cloudcumin1001 is telling me 'requests.exceptions.InvalidSchema: Missing dependencies for SOCKS support' is that me doing something silly? [15:22:18] andrewbogott: no, I think that's probably some recent change in wmcs-cookbooks [15:22:35] dcaro, ^ that related to the change you just linked us to? [15:23:08] I have not merged those yet [15:23:13] but might have been something from before [15:23:29] andrewbogott: do you have the stack trace? [15:23:33] yep! [15:23:51] https://www.irccloud.com/pastebin/uPp4bk9M/ [15:24:35] probably best if someone confirms that happens for them too before we start debugging -- could be something messed up with my env [15:24:47] which cookbook are you using? [15:24:56] oh I see it at the top [15:25:20] `Loading socks proxy config from /etc/spicerack/wmcs.yaml` that might be it [15:25:21] I get the same error [15:26:26] I added that file, and it seems to be triggering the proxy feature (that it seems not supported in the requests version installed there :/ ) [15:26:30] let me patch it out [15:27:04] maybe merging your existing patch would actually fix it? [15:27:58] Yeah, when I asked if it was 'related' to dcaro's patch I think I meant 'fixed by' [15:28:39] this should fix it https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1060130 [15:29:03] dhinus: it will not, as it still tries to load the proxy, this one will skip it unless there's any proxy-specific confdig [15:29:06] *config [15:29:39] yep makes sense [15:31:20] can I patch stuff directly on the coludcumin? [15:32:02] puppet will reset it I think [15:32:47] is there a process to test stuff there? [15:32:53] test-cookbook works [15:33:02] guys can I hold off in adding the second-nic network config for cloudcephosd1036 ? [15:33:06] like in the cumin hosts, you pass it a patch number [15:33:17] it's connected to the not-so-healthy switch, so I'm figuring the less changes to it the better [15:33:38] and also I assume it's not much help to us to survive the reboot if it's in that rack [15:33:42] topranks: that's ok yes, we were not going to setup that one either until after the reboot [15:33:50] ok cool yep [15:33:59] dhinus: okok [15:36:06] dhinus: nice, it worked fine :) [15:39:13] shall I merge it? [15:43:56] I think it should be on it? (zuul gate) [15:44:20] 17:41:13 pylint: no-member / Module 'gitlab' has no 'Gitlab' member (col 22) [15:44:27] again... on a non-related change [15:44:49] if I add the `pylint: disable=no-member` then it fails saying that the disable it's not needed xd [15:45:02] :( [15:45:20] shoot, I can't drain from cloudcumin [15:45:21] spicerack.icinga.IcingaError: Unable to read command_file configuration in /etc/icinga/icinga.cfg [15:45:41] the gate-and-submit checks are different from the checks that +2'd it before? [15:46:56] nope, just random failure :/ [15:47:11] (as in failing randomly, the failure is the same) [15:48:28] ok cloudcephosd1035, cloudcephosd1037, cloudcephosd1038 and cloudcephosd1039 now all have their second NIC configured on the switch side correctly [15:48:30] I think I have a "fix" for the issue [15:48:35] topranks: thanks! [15:48:38] that was quick :) [15:48:51] I updated netbox for cloudcephosd1036, but didn't push the changes to the faulty switch [15:49:19] ack, can you add a note in the task so we don't forget when we set it up (after the reboot) [15:50:19] sure [15:50:57] If anyone wants to nominate someone to sit on the Toolforge standards committee, edits at https://wikitech.wikimedia.org/wiki/Help_talk:Toolforge/Toolforge_standards_committee#August_2024_committee_nominations are most welcome. I will do the leg work to follow up with folks who are nominated to get them to counter sign or reject the nomination. [15:51:38] thank you bd808! [15:51:41] The only person I know not to bug about this is anticompositenumber. They have an objection to the Volunteer NDA [15:52:07] blancadesal: thanks for moving this! [15:53:35] dhinus: andrewbogott silly fix for the ci issue https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1060138/1 [15:53:41] (passed that part of the tests) [15:54:47] that is indeed silly [15:57:08] oops, that was for bd808, thanks for moving this! [16:01:18] dcaro: I can set_maintenance again! ty [16:02:19] Thanks andrewbogott and dcaro. JJMC89's fatalism about the state of the committee a few weeks ago finally goaded me into try something. I should have taken action ~4 years ago, but now is better than never. ;) [16:04:40] hmmm... so the only place I can run the ceph cookbooks is on my laptop... [16:04:54] dhinus: ^ do we have a wmcs-cookbooks enabled host that can mess with icinga? [16:05:56] nope, only alertmanager [16:06:25] icinga was harder to enable because it's only via ssh at the moment [16:06:56] while alertmanager has an API that should now be allowed from cloudcumin* [16:07:22] and of course the hope is that icinga will at some point go away :D [16:08:06] okok, so until then, andrewbogott all the ceph cookbooks have to run on your laptop [16:08:12] (fyi) [16:08:27] hm, I don't think I have a working env at the moment so that will take me a while. [16:10:23] dhinus: even it it requires hacking out the icinga silencing, I really think we need to be able to run cookbooks on shared/standard host [16:11:10] I can try to hack around that, might not be pretty [16:11:18] andrewbogott: [16:11:29] can we maybe just manually downtime? [16:11:58] cloudcumin does not have a ssh key that can log into the icing host [16:12:02] the cookbook is the one failing when trying [16:12:08] yes or just live with the alerts [16:12:21] yep, but has to be worked around in the cookbook (that I though already was :/) [16:12:22] as long as they don't page [16:12:27] https://www.irccloud.com/pastebin/nNXuzVMG/ [16:12:38] right because now it tries to contact icinga [16:12:44] but maybe it fails somewhere else, it was complaining about some icinga config [16:14:07] Sorry, I don't mean you need to hack around it right this minute so I can run the cookbooks, just that having cloud-cumin still not work a year after we set it up means it's time for desperate workarounds :) [16:14:37] we should be able to run the ceph cookbooks somewhere it's not our laptops :) [16:16:58] andrewbogott: taavi fixed *most* of the cookbooks that were not working from cloudcumins, I think only the ceph ones are left because of the icinga issue [16:17:21] I'll open a task to track that some are _still_ not working, if we don't have a task already [16:17:25] I was supposed to fix those [16:17:32] (and I thought I did) [16:18:52] are there correct-ish docs about setting up local wmcs-cookbooks? [16:19:00] * andrewbogott searching, finding lots of fragmented/obsolete things [16:19:45] andrewbogott: there's a script that should do all the work for you [16:19:59] though spicerack fails to install on python >=3.12 [16:20:09] (you'll have to hack around in the venv) [16:20:43] dhinus: what's the best way to know if you are running on a cloudcumin? [16:20:47] just check the hostname? [16:21:06] dcaro: what/where is the script? [16:21:10] let me think. you could maybe check if you're using the proxy? [16:21:23] andrewbogott: in the same repo [16:22:07] https://gerrit.wikimedia.org/r/plugins/gitiles/cloud/wmcs-cookbooks/+/refs/heads/main/utils/generate_wmcs_config.sh [16:22:28] https://gerrit.wikimedia.org/r/plugins/gitiles/cloud/wmcs-cookbooks/+/refs/heads/main#installation [16:22:36] ^ that's more or less ok [16:23:18] dcaro: self.spicerack will likely contain some attributes you can use to determine if you're on a cloudcumin [16:23:30] but I'm not sure what's the cleanest way [16:23:57] maybe we should just check capabilities and not where it's running? [16:25:52] where are the capabilities defined? [16:26:44] I would say though that out of cloudcumin, you still want to fail if icinga is not reachable [16:27:14] yes, what I meant was that if cookbook X needs feature Y, it could check if feature Y is available, instead of checking which host it's running on [16:27:25] so e.g. checking if it can ssh to icinga [16:27:46] otherwise, I think checking if the proxy is defined seems a reasonable way to determine if you're running from a lapotp [16:28:02] off topic: there's a replication alert for toolsdb, looking [16:28:09] dcaro: your docs suggest that I run 'wmcs utils/generate_wmcs_config.sh' as though 'wmcs' is something [16:28:22] that might be a typo [16:28:53] 'bash' I assume? [16:29:12] or just run the script directly, I'll fix the docs [16:30:02] dhinus: you'll never get an error if you are checking if feature X should be working by checking if feature X works xd [16:30:16] toolsdb issue: the replica host logged a clean shutdown of the mariadb service, not sure what triggered it [16:30:20] "/opt/wmf-mariadb104/bin/mysqld (initiated by: unknown): Normal shutdown" [16:30:31] I restarted the unit with sudo systemctl start mariadb [16:32:28] "systemd-logind[735]: Power key pressed." [16:32:40] funny being a virtual machine :D but I guess something triggered a VM shutdown [16:32:52] andrewbogott: any idea on what could cause it? [16:33:14] cloudvirt migration maybe? [16:33:30] I live migrated it, that shouldn't have caused a shutdown [16:33:42] or at least I /tried/ to live migrate it, and it asked for confirmation after [16:34:09] Well, I live migrated everything on the switch-affected cloudvirts. Know the ID of that host? [16:34:57] "systemd-logind[735]: Power key pressed with a virtual finger." ;) [16:35:25] :D [16:35:41] andrewbogott: let me find the id [16:36:06] a reboot is enough to stop replication, because mariadb doesn't start automatically on boot [16:37:14] andrewbogott: tools-db-3, with id i-0009a1ea [16:39:06] ok, yes, that's one of the ones I migrated [16:39:23] I do not know why it rebooted, it shouldn't have [16:40:12] andrewbogott: cookbook readme fix https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1060143 [16:40:20] well, I see that others also rebooted... [16:40:32] So 'openstack server migrate' must do a different thing again [16:40:37] Sorry for the noise dhinus [16:40:45] it lost the 'live' [16:41:32] andrewbogott: no worries, it was an easy fix :) [16:41:37] https://www.irccloud.com/pastebin/V8VMITR1/ [16:41:42] it has that option [16:41:59] yeah but it's been showing deprecation warnings when I use it... [16:42:25] there's two others [16:42:26] https://www.irccloud.com/pastebin/i8cqbYpR/ [16:42:31] not sure what's the different [16:42:34] *difference [16:43:21] Anyway, I guess my scripted migration must've cold-migrated things even though afterwards it asked me to confirm resize (which is typically the behavior after a live migration) [16:43:28] So I guess I must've rebooted a bunch of things :( [16:45:13] dhinus: https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1060144 that uses the hostname, there's no easy way to check if the proxy should be used as of right now, it does not register itself anywhere [16:46:16] dcaro: good enough for now, we can think of other ways later [16:46:35] I like the idea of the capabilities though, maybe we can add something like `with_icinga` to `wmcs.yaml` and use that from the config instead [16:48:23] hey, I think cookbooks are now working thanks to dcaro's new docs [16:49:18] \o/ [16:50:56] We have some spare cloudvirts, I'm going to pool them and then fully drain affected cloudvirts (hopefully if I use the cookbook it won't randomly reboot things) [16:58:41] dcaro: yep I think adding something to the yaml files would work [17:02:39] * andrewbogott gets some lunch while cloudvirts drain [17:07:34] andrewbogott: cloudcephosd1011 is draining (from my laptop) it will take a while, I'll try to check before I go to bed to see if it finished and let you know if you can do another, I'm stepping away for today, long day.... [17:07:43] * dcaro off [17:08:09] (feel free to page me if needed, ex. ceph goes crazy) [17:12:50] ok! [17:20:07] * dhinus also off [19:51:45] hi, I have an interesting problem. I am looking at adding a ssh key pair to the 'jenkins-deploy' wikitech user [19:52:07] and that is no more possible via the wiki [19:53:26] apparently that is nowadays done via https://idm.wikimedia.org/keymanagement/ but there is no SUL account since that is a bot ;) [19:54:22] ah yeah I remember we got an email about it [19:54:26] I'll file a task [20:03:06] hashar: you do not need a SUL account to use  https://idm.wikimedia.org/keymanagement/. YOu can ignore any prompts it gives you. [20:04:29] But please do file a bug if something in Bitu is accidentally requiring SUL attachment [20:05:32] you can also create an SUL shadow for your bot accounts if you would like. I made 5 new ones last week for bots that write to Wikitech [20:12:44] bd808: thanks ! I have filed https://phabricator.wikimedia.org/T371930 ;) [20:13:28] cause that IDM link sends me to a login page which rejects the credentials [20:13:37] credentials that work fine on wikitech.wikimedia.org ;) [20:13:47] anyway, it is not urgent [20:13:54] thanks! [20:54:05] Hello! [20:54:05] Data Platform Engineering is embarking on some strategy efforts, and to prep, we are trying to learn more about how people use data in and outside of WMF. We are doing a series of user interviews on this topic. [20:54:05] https://docs.google.com/document/d/13VEbydiAxvnDtOKrRK3riFPSQPuI4duk2L2NxUsvrjU/edit#heading=h.iuwqubwdbx3m [20:54:06] We’re looking for folks from WMF who manage and Cloud VPS / ToolForge , as well as a few relevant volunteers who use that infrastructure. [20:54:07] Can you help me find the right people? [20:57:55] hello ottomata! Our team is across the pond now so you'll find more people earlier in the day. Also Vivian is on leave for a while. [20:58:39] If you can get magnus to respond to your email she's a good choice, otherwise I can point you to a few more active volunteers [20:59:50] okay thank you! which magnus? :) [21:00:22] bryan suggseted I send an email to you and Joanna? I'll do that to coordinate, ya? [21:00:44] in the meantime, fi you want to add folks here (and give context) feel free to edit: https://docs.google.com/document/d/13VEbydiAxvnDtOKrRK3riFPSQPuI4duk2L2NxUsvrjU/edit#heading=h.f2wfwucumenm [21:07:42] ottomata: magnus is on your list already, only the writer of that list can know which magnus :) [21:08:10] I assumed you meant magnus manske [21:18:01] oh yes! dan put that there. i don't have contact for them...do you? [22:22:16] ottomata: https://en.wikipedia.org/wiki/Special:EmailUser/Magnus_Manske :) [23:17:08] oh ho!