[08:15:45] I'd like to set "prometheus::default_web_instance: 'tools'" in hiera for toolforge, what's the recommended place? [10:25:34] godog: probably https://horizon.wikimedia.org/project/puppet/ [10:43:16] godog: iirc we already did set that, but the switch is/was broken? [10:43:39] since I remember breaking prod prometheus's lvs health checks when trying to fix it [11:00:06] dhinus: ack thank you [11:00:38] taavi: ah mhhh could be, I don't see the redirect at https://prometheus.svc.toolforge.org hence my question, good point re: health checks tho [11:03:09] * godog lunch [11:37:38] godog: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1146973 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/1151612, it seems [12:01:12] taavi: ok thank you, do you remember offhand why/how it works in production? [12:02:39] godog: basically, the LVS checks except a 200 from / but that patch "fixed" it to be a redirect instead [12:03:55] ack [13:53:32] godog: shall I work on a zookeeper setup or are you already halfway through that? [14:28:18] andrewbogott: heh I haven't thought about the implementation in terms of priority tbh [14:29:03] I'll probably have a look. If it's the last thing that makes our cloudcontrols not proper failovers it seems important-ish [14:29:28] Looks like dzahn has already set up some nicely-documented zookeeper puppet classes. [14:29:50] ok SGTM [14:30:44] Need +1 for testplatform vps project creation https://phabricator.wikimedia.org/T423226 [14:31:35] Raymond_Ndibe: done [14:33:36] in terms of failover as in "host drops off the network entirely" FWIW we also have T422820 for rabbit/oslo, then that's it as far as I'm aware [14:33:37] T422820: oslo.messaging does not failover to the next rabbit host on traffic blackhole situations - https://phabricator.wikimedia.org/T422820 [14:34:21] yeah, that one seems messier :( [14:36:25] andrewbogott: I was thinking we could at least get upstream's opinion/ideas, just in case we're holding it wrong, surely we can't be the only ones having this problem [14:36:57] yeah [14:37:42] I guess maybe the normal openstack mailing list, or possibly in #openstack-oslo on OFTC although that's more for devs than operators [14:39:37] sounds good, I'll try irc tomorrow and the ML if that fails [14:39:59] andrewbogott: Raymond_Ndibe: the project name of T423226 seems awfully close to the "do not do this" part of https://phabricator.wikimedia.org/project/view/2875/ [14:39:59] T423226: Request creation of testplatform VPS project - https://phabricator.wikimedia.org/T423226 [14:42:04] taavi: you mean because it sounds team-based rather than project-based? [14:42:39] yeah [14:43:49] testplatformtests also seems bad :) [14:44:30] maybe that's a sign that the project scope should be a bit more generic than 'any tests in this general area' :) [14:49:29] * andrewbogott suggests 'ciperformance' on task [14:49:33] also not perfect [14:50:40] ciplatformtests? [14:54:15] that's not bad [15:18:45] Are these recurring maintaindbusers emails something that's already understood/handled? [16:24:44] andrewbogott: I'll have a look [16:24:53] thx [16:25:20] isn't your day pretty much over though? I can also look, just didn't want to duplicate effort [16:25:49] my plan was exactly to look at this before ending my day :D [16:26:03] (I noticed the alert before but I had a meeting) [16:26:27] ok! lmk if there's anything to hand off [16:26:39] I will [16:35:37] the error is: "Can't connect to MySQL server on 'an-redacteddb1001.eqiad.wmnet'" [16:35:45] that host is currently being reimaged [16:38:54] btullis: do you expect an-redacteddb1001 will be back online soon? ^ [16:39:25] otherwise we can remove it from maintain-dbusers [16:39:31] Oh what! It should be fully up and running, all services started. [16:40:04] All services are green in icinga. https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=an-redacteddb1001 [16:40:33] Ah, it is probably because the IP got changed. DO we have to update the zarcillo database manually, or something? [16:41:08] maybe, let me check [16:42:37] It's fine in orchestrator. [16:47:39] looks like it's a firewall problem [16:47:46] I cannot connect on the mysql port from cloudcontrol1007 [16:47:49] but I can connect to clouddbs [16:48:01] I can ping it though [16:50:09] maybe a repeat of T368316 [16:50:09] T368316: maintain-dbusers.service failing on cloudcontrol1005 - https://phabricator.wikimedia.org/T368316 [16:55:07] hmm in that case a rule was missing in homer, but the rule was added and should still be valid [16:56:30] Yes, I found this: https://gerrit.wikimedia.org/g/operations/homer/public/+/691bd9251cabb7554600ef7f055c7553f3d28494/policies/cr-cloud-hosts.yaml#121 [16:57:09] Maybe we need to run homer against the cloud switches again, if it needs to updated with fresh netbox data, or something. [16:57:10] yep exactly, but it doesn't seem to be working [16:57:38] topranks: maybe you know? ^ [17:00:48] * taavi looking [17:03:16] yeah, the latest run of the GetHosts netbox script is a week old [17:04:34] (T361549 the task for is automating noticing this) [17:04:34] T361549: Automatically run Capirca Netbox script regularly - https://phabricator.wikimedia.org/T361549 [17:07:28] seems like taavi has the answer yep, run the script then re-run homer for the cloudsw and crs [17:07:39] let me know if you need a hand [17:10:45] thanks both! taavi can you please run the script, or link me to some docs? I don't know how to run the script [17:11:11] dhinus: it was run a few mins ago by taavi [17:11:34] This is the url, running is just a matter of clicking the button: https://netbox.wikimedia.org/extras/scripts/1/ [17:11:36] dealing with it, and as usual merging a bunch of other new/decom'd hosts at the same time [17:11:59] topranks: you need to check that checkbox first! [17:12:04] very difficult, I know [17:12:14] (I failed it the first time) [17:12:26] ha yeah [17:13:33] I'll discuss in netops, to me it seems a little mad we don't have it running automatically every night or something [17:16:39] dhinus: btullis: try again now? [17:17:33] seems to be working [17:18:25] yep, I see some "Created account in an-redacteddb1001.eqiad.wmnet:" in the maintain-dbusers logs [17:24:06] the alert should clear soon [17:24:09] thanks everyone! [17:24:46] andrewbogott: should be all sorted [17:24:54] great! [17:29:27] * dhinus off [17:46:36] Nice. Thanks, all. [18:32:10] bd808: I assume you also got the github email about https://github.com/toolforge/toolforge-php-mysql webhook token needing to be rotated? [19:34:01] andrewbogott: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1271029 will not work, the .sources format is different from the .list format [19:40:04] dang, ok. I guess it was only warnings when both files were present [20:21:37] taavi: yes. I haven't made time to do anything about the email yet, but I have it. [20:37:07] stashbot, what's your deal [20:37:07] See https://wikitech.wikimedia.org/wiki/Tool:Stashbot for help. [20:37:19] wmopbot, do you also have a help page? [20:37:27] seems not [20:47:33] andrewbogott: https://wmopbot.toolforge.org/, but you may not have the IRC creds to get in [20:50:28] actually you have the founder bit here so I think wmopbot should answer if you knock [21:04:20] thanks bd808 -- I was trying to answer bliviero's general question "what is with all these bots?" and I asserted that they were self-documenting which turns out to be only partly true :) [23:30:56] I filed T423354 to get the object count quota bumped up for etherpads3 [23:30:57] T423354: Increase object storage object count for etherpads3 project - https://phabricator.wikimedia.org/T423354