[01:49:41] <andrewbogott>	 vola.ns, something is screwed up with live migration in codfw1dev (the cloudvirtxxxx-dev hosts) which I need to investigate. The cookbook should work fine for the other remaining cloudvirts.
[06:08:51] <godog>	 greetings
[07:10:54] <godog>	 ok I got this out https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/81
[07:11:18] <godog>	 pipelines are getting timeouts when fetching from github, not sure yet if the issue is on our side
[07:11:37] <godog>	 e.g. https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/jobs/624235
[07:17:01] <taavi>	 that's unfortunately a rather recurring issue
[07:17:14] <taavi>	 I tried to fix that with T403028 but that did not work as well as I hoped
[07:17:15] <stashbot>	 T403028: toolforge tofu-provisioning: Cache terraform-provider-openstack binary somewhere - https://phabricator.wikimedia.org/T403028
[07:18:14] <godog>	 ack, thank you for the context
[07:19:25] <godog>	 do you know or find out if it is gh timing out kinda on purpose ?
[07:21:55] <taavi>	 b.d808's theory is that it's some sort of throttling against our NAT egress address. although I am a bit sceptical since I suspect we'd heard from soneone else if they were hitting the same issue
[07:23:17] <godog>	 yeah it is odd, I'm failing to reproduce it so far from a random cloudvps vm
[07:24:09] <godog>	 anyways I can see the rabbit hole from here, not today
[07:42:40] <dcaro>	 morning
[07:56:12] <godog>	 ok so I'll go ahead and merge https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/81 then the apply will happen automatically (?)
[07:59:49] <dcaro>	 iirc it will ask for your input before applying (in one of those gitlab manual steps)
[07:59:54] <dcaro>	 "manually run the last step in the main branch pipeline: tofu apply"
[08:00:01] <dcaro>	 from the readme
[08:00:22] <godog>	 ah! got it, thank you I missed that part
[08:00:24] <godog>	 will do now
[08:00:59] <godog>	 https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/pipelines/136220
[08:02:57] <volans>	 thanks andrewb.ogott, I'll keep going through the list
[08:07:00] <godog>	 mmhh things are not going as expected, a new volume is being created, debugging
[08:08:29] <dcaro>	 oh, I think toolsbeta might have gotten stuck
[08:08:39] <godog>	 of course volumes are also managed by opentofu in toolsbeta
[08:08:43] <godog>	 yes that'd be me
[08:08:51] <dcaro>	 ack
[08:09:01] <dcaro>	 let me know if I can help
[08:10:00] <godog>	 ok will do! currently trying to understand what happened
[08:12:03] <godog>	 so the 10gb volume toolsbeta-nfs was disconnected from toolsbeta-nfs-3.toolsbeta.eqiad1.wikimedia.cloud by the cookbook as expected, though it is now in status "reserved" 
[08:12:22] <godog>	 and the cookbook/openstack refused to attach it to toolsbeta-nfs-4.toolsbeta.eqiad1.wikimedia.cloud
[08:12:37] <godog>	 BadRequestException: 400: Client Error for url: https://openstack.eqiad1.wikimediacloud.org:28774/v2.1/servers/19c9ecd1-6fb2-4a2d-954a-c1dc6c956034/os-volume_attachments, Invalid input received: Invalid volume: Volume 648504db-18c2-4cee-b731-567dcb4dadf6 status must be available or downloading to reserve, but the current status is reserved. (HTTP 400) (Request-ID: 
[08:12:43] <godog>	 req-62a05c8b-e50e-42f0-84de-44c00c8773b9) 
[08:13:58] <godog>	 the internet™ suggests openstack volume set --state available <uuid>
[08:14:17] <godog>	 no idea why it got in status reserved in the first place though
[08:15:24] <dcaro>	 no idea, looking
[08:15:47] <taavi>	 why are there two toolsbeta-nfs volumes?
[08:16:45] <godog>	 taavi: no idea
[08:17:19] <godog>	 the other one (4gb) is ab8ddafd-f8f9-41e3-8a79-d985e962a2ee  and description "Ignore this one, not sure where it came from"
[08:17:43] <godog>	 anyways I'll try to force the volume in state available
[08:18:38] <dhinus>	 morning
[08:19:49] <godog>	 of course the cookbook expects things in a certain state now, so re-running won't work
[08:20:39] <godog>	 ok I'll do things manually instead
[08:22:28] <dhinus>	 I think I've seen the "status is reserved" before but it's quite rare, not sure what caused it
[08:24:00] * godog nods
[08:24:30] <godog>	 ok the cookbook bits I've done manually, namely reattach the volume to the new host and flip 'profile::wmcs::nfs::standalone::cinder_attached'
[08:24:33] <dcaro>	 it rings a bell too, I think there was a way to track it on the DB, with reservations and such
[08:24:43] <godog>	 now to get https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/pipelines/136220 to succeed
[08:25:06] <godog>	 I'm open to ideas, currently failing due to timeout while installing the openstack provider per T403028
[08:25:07] <stashbot>	 T403028: toolforge tofu-provisioning: Cache terraform-provider-openstack binary somewhere - https://phabricator.wikimedia.org/T403028
[08:25:59] <dhinus>	 hmm I've never seen it fail so consistently 4 times in a row
[08:27:01] <dhinus>	 usually a single retry would work
[08:27:33] <godog>	 no idea what's the best next action tbh
[08:29:00] <taavi>	 Just One More Retry :D https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/jobs/624360
[08:29:09] <godog>	 lol
[08:29:10] <dcaro>	 from the worker it ran on curl does not complain, though the jwt token expired
[08:29:27] <godog>	 ok I'll apply manually the last step
[08:30:50] <dhinus>	 and it failed with the timeout :(
[08:31:08] <godog>	 what a bunch of BS
[08:31:39] <godog>	 I guess retry again, at least for toolsbeta
[08:33:17] <godog>	 ok toolsbeta apply worked
[08:34:56] <godog>	 waiting for dns to propagate then I'll start puppet on the nfs servers again
[08:40:45] <godog>	 ok done, rebooting the nfs workers in toolsbeta
[08:41:49] <godog>	 starting from the bastion actually
[08:42:49] <dcaro>	 I can login :)
[08:43:15] <dcaro>	 maybe a bit early xd
[08:44:14] <godog>	 oh ok I couldn't login actually
[08:46:18] <dcaro>	 `Could not chdir to home directory /home/dcaro: No such file or directory`
[08:46:33] <dcaro>	 did it reboot?
[08:46:52] <dcaro>	 https://www.irccloud.com/pastebin/HftrsnyB/
[08:47:06] <dcaro>	 not yet, I'll stop spamming, let me know when I can test
[08:47:26] <godog>	 yes it did reboot, not sure why homes are not there
[08:47:50] <dcaro>	 oh okok, looking
[08:48:17] <volans>	 stupid question, is there a way when the cookbook calls wmcs-drain-hypervisor.py to tell it to avoid the hosts to be rebooted next? In most cases it uses the hosts just rebooted because those are empty, but I got a couple of VMs moving to the next hosts to be rebooted
[08:49:06] <dcaro>	 homes are mounted, maybe they are empty?
[08:49:07] <dcaro>	 toolsbeta-nfs.svc.toolsbeta.eqiad1.wikimedia.cloud:/srv/toolsbeta/misc/shared/toolsbeta/home on /mnt/nfs/nfs-01-toolsbeta-home type nfs4 (rw,noatime,vers=4.2,rsize=524288,wsize=524288,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=172.16.18.135,local_lock=none,addr=172.16.18.119)
[08:49:37] <godog>	 so that path on toolsbeta-nfs-4.toolsbeta.eqiad1.wikimedia.cloud does have data
[08:50:50] <godog>	 dcaro: are you testing stuff on the bastion? I don't want to step on each other toes
[08:51:11] <dcaro>	 I was looking on the nfs server
[08:51:26] <dcaro>	 yep it has data, so maybe not mounted from the right nfs server?
[08:51:57] <dcaro>	 addr=172.16.18.119, and that's configured in nfs-4
[08:52:17] <dcaro>	 godog: I'm only running "non modifying" commands, you do any modification ones :)L
[08:52:27] <godog>	 lol ok thx dcaro 
[08:52:53] <dcaro>	 https://www.irccloud.com/pastebin/mc6D1kYs/
[08:52:58] <dcaro>	 that looks ok
[08:53:18] <dcaro>	 oh, now I see stuff
[08:53:25] <godog>	 yeah now it is fine, there might have been a race with the exports and initially /dev/sdb was not mounted / in fstab
[08:53:50] <godog>	 ok proceeding with nfs workers reboot
[08:54:00] <dcaro>	 so did you remount or something?
[08:54:32] <godog>	 I did remount it yes
[08:54:36] <dcaro>	 ack
[08:55:28] <godog>	 other than the nfs workers and the bastion, how to find out what else has nfs stuck ?
[08:55:51] <dcaro>	 volans:  I don't think it has any logic for that :/, it's been a while since I used it though
[08:57:06] <dcaro>	 godog: you can try modifying the 'processes in D state' there to catch all VMs in the project https://grafana-rw.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview?orgId=1&from=now-6h&to=now&timezone=utc&var-cluster_datasource=P8433460076D33992&var-cluster=tools
[08:57:24] <dcaro>	 hmm, I think it might already be doing that
[08:57:52] <godog>	 good point dcaro, yes that has the whole list afaics
[08:57:58] <volans>	 ack thx
[09:00:40] <godog>	 ok next roadblock: cloudcumin1001 is failing to talk to  https://prometheus.svc.beta.toolforge.org/tools/ and thus wmcs.toolforge.k8s.reboot_stuck_workers doesn't work
[09:01:06] <godog>	 I'm assuming some firewalling of some kind since packets seems dropped
[09:01:15] <godog>	 taavi: does that ring a bell ^
[09:01:17] <godog>	 ?
[09:02:22] <dcaro>	 you can also use the regular reboot_workers passing the list to get unblocked
[09:02:51] <godog>	 true, ok I'll do that for now
[09:14:12] <dcaro>	 for the prometheus access, it seems we will need to set `root@cloudcumin1001:~# export https_proxy=http://webproxy:8080`, with that it works
[09:14:59] <volans>	 it could just use spicerack's proxy info?
[09:15:10] <godog>	 mmhh am I misremembering running wmcs.toolforge.k8s.reboot_stuck_workers from cloudcumin1001 in the past with no additional settings ?
[09:15:21] <dcaro>	 maybe
[09:15:43] <dcaro>	 (as in I've  run it, but I'm not certain it was from cloudcumin or my local)
[09:16:07] <volans>	 we have both spicerack.http_proxy (plain) and spicerack.requests_proxies (preformatted for requests)
[09:16:18] <dcaro>	 what does that do?
[09:17:05] <volans>	 >>> spicerack.http_proxy
[09:17:05] <volans>	 'http://webproxy.eqiad.wmnet:8080'
[09:17:09] <volans>	 >>> spicerack.requests_proxies
[09:17:09] <volans>	 {'http': 'http://webproxy.eqiad.wmnet:8080', 'https': 'http://webproxy.eqiad.wmnet:8080'}
[09:17:43] <volans>	 and you can pass it directly to requests.get(...., proxies=spicerack.requests_proxies)
[09:17:55] <volans>	 for example
[09:18:09] <dcaro>	 is it set in cloudcumin?
[09:18:14] <dcaro>	 where does it get the info from?
[09:18:20] <volans>	 yes, puppet
[09:18:28] <volans>	 this was spicerack-sheel from cloudcumin1001
[09:18:51] <dcaro>	 https://www.irccloud.com/pastebin/6OFTHzxN/
[09:18:53] <volans>	 it gets pushed by puppet to /etc/spicerack/config.yaml
[09:18:56] <dcaro>	 config yep
[09:19:16] <dcaro>	 sounds like a good option yep
[09:19:38] <dhinus>	 I also remember we could call prometheus from cloudcumin without problems
[09:19:49] <dhinus>	 so I'm not sure why it's not working now
[09:27:29] <volans>	 the donwtime of the safe_reboot cookbook has some problems over 3 hosts that I've done for 2 they alerted in the -feed channel
[09:32:57] <dhinus>	 was it an icinga or am alert?
[09:33:41] <dhinus>	 I think we cannot downtime in icinga so those ones will still trigger. we decided it was fine because... "we're gonna deprecate icinga" LOL
[09:33:54] <volans>	 why can't downtime in icinga?
[09:34:03] <dhinus>	 cloudcumin cannot ssh to the icina host
[09:34:07] <dhinus>	 *icinga
[09:34:18] <dhinus>	 (IIRC)
[09:34:22] <volans>	 Host cloudvirt1062 is DOWN should be icinga, but also FIRING: NeutronAgentDown: Neutron neutron-openvswitch-agent that should be AM
[09:34:24] <dcaro>	 there's a couple yep, unable to get to nrpe
[09:34:27] <volans>	 ah right
[09:34:58] <dcaro>	 they have a silence also, but it expired
[09:35:27] <volans>	 yes but from a quick glance of the cookbook it sets it to 30m and it took less than that, I'll dig later maybe
[09:35:42] <dcaro>	 godog: puppet is failing in the toolsbeta bastion, with the clouddumps mount issue we saw before :/
[09:35:54] <godog>	 ouch, thx dcaro 
[09:35:57] <godog>	 will take a look
[09:36:06] <dcaro>	 `Started 29 minutes ago Expired 8 minutes ago`
[09:36:12] <dhinus>	 volans: could it be that after reboot the cookbook immediately deletes the silence?
[09:36:14] <dcaro>	 for the alert, from karma at loast
[09:36:35] <volans>	 it shouldn't say expired though no?
[09:36:38] <volans>	 if you delete it
[09:36:39] <dcaro>	 https://alerts.wikimedia.org/?q=team%3Dwmcs
[09:36:44] <dcaro>	 I think so yes
[09:36:45] <dhinus>	 you're right, so I don't know
[09:37:05] <volans>	 I'll digg when I can, thanks, sorry for the noise
[09:37:15] <dcaro>	 maybe it's 20min the time it uses
[09:37:17] <dcaro>	 np
[09:37:27] <dhinus>	 np, thank you for looking, I remember we had some unexpected alerts in past reboots
[09:37:35] <dhinus>	 but I'm not sure it was the same cookbook
[09:38:39] <volans>	 for andrew, I did few other cloudvirts, task updated, you can continue from there
[09:38:46] * volans has to go afk
[09:39:42] <dcaro>	 godog: how is it going, should I run the full functional tests suite? (wmcs.toolforge.run_tests cookbook, you can also if you want, takes ~20min currently iirc)
[09:40:34] <godog>	 dcaro: so puppet on bastion is fixed, nfs workers all rebooted
[09:40:38] <dcaro>	 alerts went away
[09:40:49] <godog>	 going through the list of non nfs workers atm and reboot what's stuck
[09:41:06] <dcaro>	 ack
[09:41:10] <godog>	 things are converging though at least
[09:41:33] <godog>	 I took notes as I went along, will update T404584 once done
[09:41:34] <stashbot>	 T404584: [tools,nfs,infra] Address tools NFS getting stuck with processes in D state - https://phabricator.wikimedia.org/T404584
[09:42:05] <dcaro>	 we can do toolsbeta again if you had to change any cookbook things
[09:42:16] <dcaro>	 though it would not need to change networks this time
[09:43:18] <godog>	 yeah good point, at least we would sidestep opentofu and test the rest, I'll think about it
[09:45:32] <dcaro>	 btw. let's get rid of that second volume https://gitlab.wikimedia.org/repos/cloud/toolforge/tofu-provisioning/-/merge_requests/82
[09:46:02] <dhinus>	 is that mounted anywhere?
[09:46:35] <godog>	 I don't think so
[09:46:38] <godog>	 checking
[09:46:48] <dhinus>	 unattached
[09:46:55] <dhinus>	 burn it :)
[09:47:11] <godog>	 +1
[09:47:28] <dhinus>	 I'm confused by the fact I see two volumes with the same name in https://openstack-browser.toolforge.org/project/toolsbeta
[09:47:35] <dhinus>	 I didn't think it was possible
[09:47:47] <dhinus>	 it should delete by ID anyway
[09:48:02] <godog>	 yeah names can be duplicated, ask me how I know
[09:48:12] <dhinus>	 :D
[09:50:14] <dcaro>	 I suspect that there might be a reason why it's not deleted though (so I'm slightly expecting for openstack to fail to delete)
[09:51:09] <dcaro>	 hmm... it seems puppet is failing to run on many toolsbeta vms
[09:53:49] <taavi>	 godog: sorry missed your ping, but seems like you sorted it already?
[09:53:56] <godog>	 on which hosts dcaro ?
[09:54:22] <godog>	 taavi: no problem, yeah we were wondering why cloudcumin1001 can't talk to svc.toolforge.org anymore
[09:54:32] <godog>	 even though I seem to remember it could
[09:54:46] <godog>	 but yes at any rate we should be going through the prod proxy anyways I think
[09:54:52] <dcaro>	 https://prometheus.wmcloud.org/graph?g0.expr=puppet_agent_failed%7Bproject%3D%22toolsbeta%22%7D%0A&g0.tab=1&g0.stacked=0&g0.range_input=1h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=1&g0.store_matches=%5B%5D&g0.engine=prometheus&g0.analyze=0&g0.tenant=
[09:55:00] <dcaro>	 I'm guessing the dumps thing also
[09:55:39] <dcaro>	 (you can get to that query kinda from the alert in karma)
[09:55:42] <dcaro>	 https://usercontent.irccloud-cdn.com/file/KjKiAzvk/image.png
[09:55:59] <dcaro>	 https://usercontent.irccloud-cdn.com/file/tCHblyLa/image.png
[09:56:19] <taavi>	 godog: yeah it seems like the reboot_stuck_workers just does a plain requests.get() against the prometheus address instead of using the prometheus wrapper in spicerack which IIRC does the right thing with the proxies
[09:56:25] <godog>	 yes my bad I'm used to show only active alerts in karma dcaro 
[09:56:56] <dcaro>	 hmm... not sure why mail needs dumps nfs
[09:57:29] <taavi>	 mail needs user/project nfs for processing .forward files, and the toggle for nfs mounts is all-or-nothing
[09:57:30] <dcaro>	 we might not be able to mount only homes I guess in puppet
[09:57:38] <dcaro>	 yep
[09:57:46] <godog>	 taavi: ack thank you that makes sense, I'll add a note to followup on that too
[09:59:08] <godog>	 dcaro: looks like to me a (new?) side effect of hard rebooting vms
[09:59:22] <dcaro>	 yep, I had not seen that nfs mount issue before
[10:00:26] <godog>	 also yet unclear to me why only clouddumps1001 and not 1002
[10:00:38] <godog>	 but anyways, I'll expire the clients manually for now then take a break
[10:00:45] <dcaro>	 👍
[10:00:49] <dcaro>	 I'll go for lunch soon too
[10:04:06] <dcaro>	 volume deleted correctly :)
[10:04:16] <dhinus>	 dcaro: nice, thanks
[10:04:43] <godog>	 you got lucky too with toolsbeta pipeline failing and not tools
[10:59:21] <taavi>	 ty
[11:56:13] <taavi>	 re mounts being all-or-nothing: T405462 https://gerrit.wikimedia.org/r/c/operations/puppet/+/1191005
[11:56:13] <stashbot>	 T405462: Allow customizing which mounts to enable per VM - https://phabricator.wikimedia.org/T405462
[12:03:38] <godog>	 interesting, it is not clear to me which problem that would solve though
[12:04:25] <dcaro>	 just not mounting the dumps nfs in vms that don't need it
[12:04:30] <dcaro>	 (like tools-mail)
[12:05:35] <godog>	 is that the problem? I'm not against it I'd like to understand
[12:12:24] <dcaro>	 yep, that would be what that fixes
[12:13:28] <taavi>	 I wouldn't call it broken, but cleanup and general problem avoidance
[12:13:38] <taavi>	 an NFS mount cannot break if it does not exist
[12:13:52] <dcaro>	 agree
[12:20:07] <godog>	 fair enough, so which mounts will effectively appear is not only controlled by modules/cloudnfs/data/projects.yaml but also that variable?
[12:21:52] <godog>	 I guess I'm wary of complicating an already brittle/complicated situation (from my POV anyways) for relatively little gain
[12:22:04] <godog>	 if clouddumps is busted then we'll go around anyways and reboot VMs
[12:22:13] <godog>	 s/if/when/
[12:25:46] <dcaro>	 it seems though that every time you reboot a vm you have to unbust it from clouddumps manually as it is
[12:26:08] <dcaro>	 agree though that it's not a big gain
[12:27:30] <godog>	 yes kicking clients from clouddumps1001  after hard reboot is still an open issue
[12:27:43] <godog>	 from clouddumps1001 only so far, not 1002, anyways
[12:27:54] <dcaro>	 yep, kinda weird
[12:28:17] <godog>	 I see obvious disadvantages vs dubious advantages with the patch, my two cents
[12:33:56] <taavi>	 hmm, you have a very good point
[13:18:24] <andrewbogott>	 when pcc says "Hosts that were skipped (fail fast)" that usually means it doesn't have facts for the host?
[13:18:36] <andrewbogott>	 seeing that more and more lately, maybe because I'm reimaging lots of things
[13:18:59] <taavi>	 do you have an example?
[13:19:07] <andrewbogott>	 sure, https://puppet-compiler.wmflabs.org/output/1191018/7536/
[13:19:20] <andrewbogott>	 most of the time it's a typo in the hostname but today I think it isn't.  Hubris!
[13:19:53] <taavi>	 "cloudcephosd1035.eqiad.wmnet cloudcephmon1004.eqiad.wmnet" is not an individual host
[13:20:04] <taavi>	 your Hosts: commit trailer is missing a comma between those two
[13:20:32] <andrewbogott>	 heh, a different sort of typo
[13:20:35] <andrewbogott>	 thx
[13:25:30] <andrewbogott>	 works much better with the ,
[13:30:13] <dcaro>	 does it care about spaces? (I hope not)
[13:35:15] <dcaro>	 hmm... I'm finding some tools with weird memory limit/request values
[13:35:21] <dcaro>	 https://www.irccloud.com/pastebin/hC0lM7B3/
[13:35:31] <dcaro>	 only a few though
[13:35:58] <andrewbogott>	 is that something a user is setting with a decimal.point?
[13:37:04] <dcaro>	 I suspect might be from very old jobs
[13:38:15] <dcaro>	 tool-os-deprecation has that issue, but the crons are from august :/
[13:38:53] <dcaro>	 https://www.irccloud.com/pastebin/tiMU8Xkd/
[13:39:03] <dcaro>	 somehow the 2G got rounded up weird somewhere
[13:39:19] <andrewbogott>	 that reminded me to look at https://os-deprecation.toolforge.org/ and shudder at the deployment-prep list :(
[13:40:02] <dcaro>	 re-uploading the  cron get it ok this time though
[13:40:11] <dcaro>	 might have been a jobs-api thing at some point
[13:49:54] <cdanis>	 seeing these failures in production puppet CI:
[13:49:56] <cdanis>	 13:46:33 rspec './modules/profile/spec/classes/profile_cloudceph_client_rbd_backy_spec.rb[1:1:1:2]' # profile::cloudceph::client::rbd_backy on debian-11-x86_64 when no ceph repo passed uses correct default is expected to contain Apt::Package_from_component[ceph] with component => "thirdparty/ceph-quincy"
[13:50:58] <taavi>	 andrewbogott: ^
[13:51:38] <andrewbogott>	 wow "when no ceph repo passed uses correct default is expected to contain " is hard to parse
[13:51:43] <andrewbogott>	 but I think I can fix
[13:52:16] <cdanis>	 very possible the assumptions of the spec tests need to be adjusted
[13:53:25] <taavi>	 i think the test name may missing a verb
[13:53:58] <andrewbogott>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/1191027
[13:54:12] <andrewbogott>	 I wonder why my patch that introduced the problem passed?
[13:55:33] <dcaro>	 I think it's just missing a `the`, maybe `the correct default repo`
[13:56:31] <andrewbogott>	 cdanis: patch is merged, lmk if that doesn't fix things
[13:59:25] <cdanis>	 jenkins is happy now, thanks :)
[14:02:56] <andrewbogott>	 topranks: have time for the sync meeting?
[14:03:31] <topranks>	 sorry yes just lost track joining now 
[14:51:46] <andrewbogott>	 godog: I'd like to pool cloudcephosd1052 because it has slightly irregular hardware and I want to make sure it works. Is it ok if I reserve 1050 and 1051 for your work and pool 1052?
[14:51:58] <godog>	 andrewbogott: sure SGTM
[16:20:49] <andrewbogott>	 What's with the NFS alert?
[16:21:02] <andrewbogott>	 godog, that you?
[16:21:24] <dcaro>	 looks like a blip, the url works for me now
[16:21:27] <andrewbogott>	 ok
[16:21:32] <andrewbogott>	 and there's the recovery
[16:21:42] <andrewbogott>	 I'm pooling an OSD node...
[16:21:49] <dcaro>	 hmm, interesting
[16:21:55] <andrewbogott>	 but only a little bit, and it's been ongoing for >1 hour
[16:22:02] <andrewbogott>	 so I don't know why it would complain now
[16:22:12] <dcaro>	 then maybe not so much xd, might be the dell disks moving really old data?
[16:22:27] <dcaro>	 might also just be a timeout somewhere in between
[16:22:38] <dcaro>	 toolchecker is not within toolforge though right?
[16:22:39] <andrewbogott>	 lol, this dashboard is funny https://grafana.wikimedia.org/d/P1tFnn3Mk/wmcs-ceph-eqiad-health?orgId=1&from=now-6h&to=now&timezone=utc
[16:22:43] <dcaro>	 that one might be
[16:23:01] <taavi>	 toolschecker is a separate vm in the tools project
[16:24:11] <andrewbogott>	 the host I'm pooling shows a disk saturation spike 5 minutes ago https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&from=now-3h&to=now&timezone=utc&var-server=cloudcephosd1052&var-datasource=000000026&var-cluster=wmcs&refresh=5m
[16:24:12] <dcaro>	 iirc there was some part of it running as a cronjob or something?
[16:25:34] <dcaro>	 andrewbogott: maybe
[16:25:47] <dcaro>	 everything else looks quite ok though
[16:25:55] <andrewbogott>	 yeah.
[16:26:03] <dcaro>	 there's some lost pings around the network not too long ago, probably a switch getting a bit busy?
[16:28:35] <dcaro>	 nah, it's the reboot of cloudcephmon1005
[16:30:51] <andrewbogott>	 that certainly shouldn't have caused NFS issues
[16:31:07] <andrewbogott>	 I think? There was always a quorum
[16:32:16] <dcaro>	 argee
[16:35:08] <andrewbogott>	 looks like it might take ~24 hours for things to rebalance after adding this OSD, so It's safe to assume that rebalancing is ongoing during your day tomorrow if you see other blips. 
[16:38:54] <dcaro>	 ack
[17:35:47] * dcaro off
[17:35:49] <dcaro>	 cya tomorrow