[00:19:42] /srv/ops/harbor is full on toolsbeta-harbor-1. I'm not clear on if that harbor instance is doing anything these days but it's paging me so I'm copying this giant backup.tar.gz file down to my laptop so I can delete it off the full volume. [09:29:43] dhinus: quick review for https://gerrit.wikimedia.org/r/c/operations/puppet/+/1153565? apparently the current config structure and what the current scripts expects are not the same :/ [09:40:51] taavi: looking [09:43:14] after that I think we can try moving traffic to x3 once again, https://gerrit.wikimedia.org/r/c/operations/puppet/+/1153564/ [09:48:46] hmm did that _ever_ work? it was introduced here https://gerrit.wikimedia.org/r/c/operations/puppet/+/657890 [09:49:10] dhinus: my best guess is that it never did [09:49:37] similar to https://gerrit.wikimedia.org/r/c/operations/puppet/+/1153562, which has also been broken for ages probably [09:52:06] * dhinus really wants to add more tests :P [09:52:16] (and/or types) [09:52:33] I +1d the first one [09:52:50] yeah, me too but E_NOTENOUGHTIME [09:52:59] same :P [09:53:10] let's get this working first [09:54:16] parameter 'section_ports' key of entry 'm7' expects a match for Profile::Mariadb::Valid_section = Enum['analytics_meta', 'backup1-codfw', 'backup1-eqiad', 'es1', 'es2', 'es3', 'es4', 'es5', 'es6', 'es7', 'm1', 'm2', 'm3', 'm5', 'matomo', 'ms1', 'ms2', 'ms3', 'pc1', 'pc2', 'pc3', 'pc4', 'pc5', 'pc6', 'pc7', 'pc8', 's1', 's2', 's3', 's4', 's5', [09:54:16] 's6', 's7', 's8', 'staging', 'tendril', 'test-s1', 'test-s4', 'x1', 'x3', 'zarcillo'], got 'm7' [09:54:16] parameter 'section_ports' key of entry 'm7' expects a match for Profile::Mariadb::Valid_section = Enum['analytics_meta', 'backup1-codfw', 'backup1-eqiad', 'es1', 'es2', 'es3', 'es4', 'es5', 'es6', 'es7', 'm1', 'm2', 'm3', 'm5', 'matomo', 'ms1', 'ms2', 'ms3', 'pc1', 'pc2', 'pc3', 'pc4', 'pc5', 'pc6', 'pc7', 'pc8', 's1', 's2', 's3', 's4', 's5', [09:54:17] 's6', 's7', 's8', 'staging', 'tendril', 'test-s1', 'test-s4', 'x1', 'x3', 'zarcillo'], got 'm7' [09:54:24] sigh, should've ran a PCC first [09:54:48] well at least it's breaking there and not silently later :) [09:55:13] what is m7 btw? [09:55:16] no clue [09:55:57] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1153571 [10:00:25] +1d, I will open a task to track down why m7 exists in hiera, but is not "valid" [10:10:07] T395999 [10:10:08] T395999: m7 section exists in hiera, but not in valid_sections.pp - https://phabricator.wikimedia.org/T395999 [10:19:28] dhinus: merged, and now it created all the accounts successfully? [10:19:29] Jun 04 10:19:01 cloudcontrol1007 maintain-dbusers[110196]: INFO [root._create_accounts_on_host:1005] Created account in clouddb1016.eqiad.wmnet:3363 for tool tools.catscan2 [10:19:29] Jun 04 10:19:01 cloudcontrol1007 maintain-dbusers[110196]: INFO [root._create_accounts_on_host:1005] Created account in clouddb1016.eqiad.wmnet:3363 for tool tools.quarry [10:19:29] Jun 04 10:19:02 cloudcontrol1007 maintain-dbusers[110196]: INFO [root._create_accounts_on_host:1005] Created account in clouddb1016.eqiad.wmnet:3363 for user wikiscan [10:19:50] although I'm a bit sceptical since it claims to have created those on 1020 just fine, but I see no way that could've worked then [10:23:23] which port on 1020? [10:23:30] maybe in the old sections? [10:26:27] sorry, i mean x3 on 1020, the script thinks all accounts there created fine but I'm not sure I see how that could've happened with that config format bug [10:28:53] the config format patch is merged though, or is there another one? [10:29:59] the script thinks those accounts on 1020 were created before that patch was merged [10:30:50] I see [10:35:00] maybe it was creating the accounts but ignoring the custom max_connections? [10:37:04] dhinus: uh, no, I really don't understand how but it has created the users with the correct values [10:37:09] want to try moving traffic? [10:37:17] sure [10:37:47] alright, I'm merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/1153564 [10:42:53] looking for MAX_USER_CONNECTIONS I found they're duplicated in modules/role/templates/mariadb/grants/wiki-replicas.sql [10:43:29] I'm not sure who/what uses that file, maybe the grants were copy/pasted manually? [10:43:41] what a mess [10:44:08] I'm surprised anything is even working at this point :D [10:46:49] I would probably drop the MAX_USER_CONNECTIONS from modules/role/templates/mariadb/grants/wiki-replicas.sql, then manually delete one of the grants, and check that maintain-dbusers recreates that correctly [10:52:05] except that the sql file grants quarry 48 connections but that's not present in the hiera yaml [10:52:32] no, quarry is there, the analytics user (with 200) is not [10:53:12] yep, I'm creating a patch that moves the analytics one to the yaml file, and removes everything from the sql [10:53:17] perfect, thanks [10:53:30] I'll update the wikitech news and send an email announcement? [10:54:52] or actually I'll deploy the updated `sql` first and send an announcement after that [10:55:05] sounds good [10:56:05] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1153579 [10:57:28] * dhinus lunch [11:08:51] folks fyi due to netbox changes an updated acl is being pushed by automation to our core routers in codfw [11:09:11] seems to be replacing cloudcontrol2009-dev IPs with cloudcontrol2010-dev IPs, does that make sense? [11:30:50] topranks: that surprises me but https://phabricator.wikimedia.org/T393102 is in progress so likely related? [11:31:20] Also there is no cloudcontrol2009-dev anymore (it was renamed) so seems harmless at worst. [11:32:00] yeah, those acl's are based on the set of active servers in netbox with a given named prefix [11:32:24] the script that updates them doesn't run often enough probably were a little later here, but yes all seems to match that work [11:32:41] andrewbogott: sorry I was afk the other day you were asking me about a cloudceph node? [11:33:45] hm, yeah, I have a backlog now of three or four network questions... if you have time I will try to put them in order. [11:33:54] Do you want the easy ones first or the hard one? [11:35:10] let's warm up with the easy ones :D [11:35:21] I think the simplest is https://phabricator.wikimedia.org/T393614#10880601, just moving a host to a different port [11:36:13] which host specifically needs to move? [11:37:11] huh, he doesn't say does he? Can netbox tell us what's plugged into 24? [11:38:38] none of the hosts appear to be in netbox already so it's a moot point [11:38:52] oops forgot which dc, s/he/she/ [11:39:28] I suspect that it's moving an existing server out of the way... but I think that will have to wait until Texans wake up. I'll ask on task [11:39:56] I just replied there yeah [11:40:06] anyway yeah that one ought to be easy to fix up [11:40:11] ok, thanks. done :) [11:40:20] I think it's the $1,000 dollar question next [11:41:02] Next is https://phabricator.wikimedia.org/T394333#10881532 which is maybe just a DC issue as well, could be it's just plugged into the wrong place. [11:42:08] but when I try to set up that host the cookbook tells me that it's failing ping tests [11:42:11] which seems important! [11:46:35] that is the first in a line of new&weird cloudceph OSDs with boss cards. We're hoping to have them connected to 25G ports because they handle so much storage. [11:49:51] I commented back on the task there [11:51:11] hm, I thought that new ceph nodes had only a single port connected... [11:51:15] * andrewbogott looks for an example [11:52:20] nope, I'm wrong [11:52:34] so I will follow up on that one, guess that's my mistake or david's [11:53:17] I've just run sre.dns.netbox for the krb1001 decom and it add octavia-lb-mgmz-net, that sounds like some OpenStack related thing? [11:53:29] octavia-lb-mgmt-net [11:53:54] moritzm: yeah, that's me although I'm sure I ran the script already after I made that change... [11:54:06] well anyway, it's fine for you to allow that. Sorry for the confusion. [11:54:51] ok, thanks for confirming, I'll push it along [11:55:42] topranks: ok, next... trying to set up cloudcontrol2010-dev and it fails making its dhcp request on startup. I'm pretty sure it's not reaching install2004 at all but not 100%. [11:59:16] that host is in a reboot/fail loop now so it's probably trying dhcp over and over, but I can restart the reimage whenever you need. [12:01:33] andrewbogott: yeah we'd love to get the ceph hosts to one link, really is no need for two [12:01:35] tbd again [12:01:44] is there a task for cloudcontrol2010? [12:01:58] yes, T393102 [12:01:59] T393102: Q4:rack/setup/install cloudcontrol2010-dev - https://phabricator.wikimedia.org/T393102 [12:12:45] andrewbogott: what is the current status of cloudcontrol2010, is there a reimage running? [12:12:56] if not I will kick one off and see if I can see what's wrong [12:13:05] do yo have the cookbook command you were running handy? [12:14:28] sudo cookbook sre.hosts.reimage --os bullseye --new cloudcontrol2010-dev [12:14:37] no cookbook currently running [12:14:57] I'm attached to the console though, want me to detach so you can watch? [12:15:46] nah actually I spotted the problem [12:16:03] dc-ops added it on the cloud-private vlan rather than the cloud-hosts one that everything should be using as primary [12:17:18] ok, cool, thank you! [12:17:46] Want me to start the reimage on my end then? [12:18:07] no I need to fix up all the IPs and DNS first give me a few minutes I'll let you know [12:18:49] ok! [12:28:40] andrewbogott: ok you can try that reimage again now, with any luck it'll work ok [12:28:44] was there another one? [12:29:29] yep, one more, related to https://gerrit.wikimedia.org/r/c/operations/puppet/+/1152878/1/hieradata/role/eqiad/wmcs/cloudgw.yaml [12:30:13] this one I understand even less than the others. I'm trying to reproduce the management network that Arturo built in codfw1dev which works. [12:30:35] The ultimate goal is for a service to be able to ssh to VMs on that mgmt network from cloudcontrols. [12:31:09] So for instance, I have an 'amphora' VM in codfw1dev that's reachable via root@cloudcontrol2004-dev:~# telnet 172.16.131.107 22 [12:31:26] but a similar example in eqiad1 times out [12:31:27] root@cloudcontrol1007:~# telnet 172.16.24.37 22 [12:31:50] Everything /looks/ the same to me but I haven't been able to find where the issue is. Could be the subnet is totally wrong, or could just be a firewall someplace. [12:31:55] How's that for open-ended? [12:31:58] remember to reboot cloudgw servers after applying routing changes [12:32:18] because puppet won't run /etc/network/interface changes, only deploy the files to the fs [12:32:57] so in order for the routing to be enable, you need to either restart networking on the cloudgw servers, or reboot (which is safer) [12:32:58] arturo: ok... can I reboot those any time as long as I only do one at a time? [12:33:02] yes [12:33:18] IRC bots should get killed as alwasy [12:33:28] always* [12:33:46] topranks: I'll start with that [12:34:19] there's the roll_reboot_cloudgws cookbook which'll deal with the order and such by itself [12:36:41] oh, oops, I'm already doing it the wrong way [12:37:05] * andrewbogott should always check the cookbook list first [12:37:48] lol [12:38:44] topranks: that pxe boot is working now [12:38:55] woohoo! [12:43:21] * andrewbogott rebooting the second cloudgw [12:47:05] those reboots don't seem to have resolved things, my telnet test still hangs [12:48:29] not sure if this is the direct cause, but 172.16.24.0/24 is missing from modules/network/data/data.yaml [12:53:06] thank you taavi! I just made https://gerrit.wikimedia.org/r/c/operations/puppet/+/1153614 [12:53:37] i think those names need to be unique, so need some sort of -eqiad/-codfw suffixes or something [12:55:12] ok -- they're otherwise arbitrary though right? [12:56:58] patch updated [12:59:23] i'll do a pcc [13:02:36] https://puppet-compiler.wmflabs.org/output/1153614/5761/ is indeed adding the missing route to cloudcontrols [13:12:57] well that seems important. I guess that also means rebooting cloudgws again [13:13:37] nope [13:13:55] oh right because it's not cloudgw routing [13:14:08] the route on the cloudgws is handled by cloudgw-specific hiera (ugh, yes) which is already done, and the route on other hosts on cloud-private is handled by that [13:14:16] so this is just a 'wait for puppet to do its thing' situation [13:17:24] * andrewbogott considers opening the windows, notes that Manitoba is still on fire [13:22:59] * andrewbogott waits for puppet to do its thing, steps away for a bit [13:39:42] taavi: I merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/1153563, then manually deleted the user s53272 on clouddb1020:x3 [13:39:48] maintain-dbusers is NOT recreating it :/ [13:40:01] I tried restarting the unit as well [13:40:18] dhinus: yes, maintain-dbusers has its own separate database somewhere, and as long as that database says the user exists then it won't recreate it [13:40:23] taavi: that was the last piece, new loadbalancer now shows ACTIVE/ONLINE [13:40:24] ah yes [13:40:45] thank you topranks & taavi! I think that's all of my network confusion resolved for the next 20 minutes or so. [13:43:58] Can someone catch me up about toolsbeta-harbor-1 vs toolsbeta-harbor-2? -1 is going to fill up its drive soon, I can resize or I can just delete it if it's obsolete. [14:57:42] taavi: "maintain-dbusers delete" worked fine, but it also deleted the user on an-redacteddb1001, which I didn't think about [14:57:55] I pinged btullis on Slack, I think it makes sense to have all dbs in sync anyway [14:58:10] it should be fine, it'll get recreated in a moment anyway [14:58:15] yes it's already back [14:58:19] but with a new password, I imagine? [14:58:37] oh right, yes [14:58:39] whoops [14:59:25] I should've used one of the other 3 users to do this test, but I didn't think about it :) [14:59:45] the good news is that maintain-dbusers did recreate it with the right value for max_connections [19:23:01] andrewbogott: fyi `project/paws/userhomes/78235578/a/` seems to have ~117GB of space, it seems like an android sdk and tooling and such, might be interesting to contact the user to clean it up [19:24:24] there's also a >10GB file: [19:24:24] > 11G project/paws/userhomes/49538503/wikidown/july-graffiti/all-images.zip [19:24:32] so I think the cleanup cron is not working [19:24:41] (no idea how it's setup) [19:24:56] anyhow, /me off [19:25:03] thanks for looking into it [19:25:04] @ [19:25:05] ! [21:00:39] New quota request https://phabricator.wikimedia.org/T396073. Anyone around to +1? [22:52:33] Raymond_Ndibe: +1 given if you are still around [23:05:00] Ok thanks Bryan. Will handle it