[10:59:51] <_joe_> heads-up: some mw servers in eqiad still have puppet disabled. I will reenable it after lunch when I am confident I can conclude the switch to envoy [12:06:23] hello! I could use a refresh of the Doxygen debian package for Buster. I have crafted a patch that updates our fork from upstream https://gerrit.wikimedia.org/r/#/c/operations/debs/doxygen/+/589416/ [12:06:47] that is for CI containers and the result should go under buster-wikimedia component/ci [12:07:06] would anyone please be able to assist? ;] [12:09:30] <_joe_> hashar: I will take a look later [12:09:40] cool thanks :] [12:10:16] <_joe_> I guess you're just backporting from sid? [12:13:32] yeah pretty much [12:14:21] that change is a merge commit with some fix incorporated. Gerrit doesn't necessarily show them nicely [12:14:32] <_joe_> what fix? [12:14:35] but https://gerrit.wikimedia.org/r/#/c/operations/debs/doxygen/+/589416/-2..6 shows the diff of all changes in the repo compared to upstream [12:14:46] <_joe_> uhm ok [12:14:53] <_joe_> not ideal but at least I can review it [12:15:04] notably relax qtbase5-dev version and use llvm7 instead of 9 [12:15:51] <_joe_> which is the backport, yes [12:16:14] `Error checking bridges on destination node 'ganeti1001.eqiad.wmnet': Error 60: server certificate verification failed. CAfile: /var/lib/ganeti/server.pem CRLfile: none` [12:16:23] is there a known issue with the eqiad ganeti cluster? [12:17:23] <_joe_> no [12:17:38] kormat: what are yoy trying to do? [12:17:39] <_joe_> but possibly the same problem we had on codfw [12:17:47] <_joe_> so the certs are expired [12:17:51] volans: start an instance [12:18:02] 1001 is not the master [12:18:03] <_joe_> mutante is off right now, he fixed codfw the other week [12:18:04] _joe_: ah ok [12:18:12] 1003 is the current master [12:18:15] in that cluster [12:18:22] volans: yes? i'm running this _from_ 1003. [12:18:32] it's complaining about 1001, which is where the instance lives [12:18:49] ah ok, I read it as you were running it on 1001 [12:19:00] looks like https://wikitech.wikimedia.org/wiki/Ganeti#Expired_cluster_certificates documents this case [12:19:02] so yes most likely cert issue [12:19:40] YAMCA :/ [12:20:03] <_joe_> akosiaris: around? [12:20:03] (yet another missing certification authority) [12:20:12] <_joe_> volans: *grrr* [12:23:46] <_joe_> ok, shouldn't just commit this file to puppet and distribute it? [12:24:30] <_joe_> it seems like the kind of stuff you want puppet to distribute and restart the services once that happens [12:24:38] <_joe_> or am I missing something? [12:24:50] _joe_: from the description, it sounds like you need a coordinated restart [12:25:11] (once you've gotten to the state where the certs have expired) [12:26:41] <_joe_> kormat: uhm you're correct [12:27:08] _joe_: yes [12:27:53] oh those expired? [12:27:56] lemme fix that [12:28:07] <_joe_> Not After : May 25 11:34:02 2020 GMT [12:28:09] <_joe_> yes [12:28:46] they expired while i was on lunch, which was nice of them :) [13:13:08] akosiaris: thanks! [13:15:34] kormat: I meant to code an alert for those but never got around to it unfortunately [13:16:14] an event that happens once every 5 years is hard to prioritize, i was impressed that it was documented [13:19:15] <_joe_> given the impact is limited, I think it's overall ok [13:32:07] we use to have one of those with puppet [13:32:38] a table in the db overflowed every 5 years or so [13:34:00] https://wikitech.wikimedia.org/w/index.php?title=Puppet&oldid=472592#Puppet_failure_on_all_hosts_with_Error:_Could_not_retrieve_catalog_from_remote_server:_Error_400_on_SERVER:_Mysql::Error:_Out_of_range_value_for_column_'id'_at_row_1:_INSERT_INTO_%60fact_values%60_(%60updated_at%60,_%60host_id%60,_%60created_at%60,_%60fact_name_id%60,_%60value%60)_VALUES_... [13:37:38] <_joe_> jynus: that was solved moving to postgres [13:37:44] I know [13:37:53] <_joe_> now that puppet uses a real database we don't have problems anymore [13:38:16] untrue, if we wanted a real database we had gone with Oracle [13:40:29] haha [13:41:49] _joe_: thx for the doxygen patch. Have you got it build on boron ? If so it can be uploaded to buster-wikimedia component/ci ;) [13:42:02] <_joe_> not on boron but yes, doing it [13:42:20] <_joe_> hashar: component/ci ? cool [13:42:29] somehow yeah ;) [13:43:13] _joe_: did you run a compiler for gerrit/598463 ? [13:46:59] <_joe_> ? [13:47:12] the _roles/_role change [13:47:18] <_joe_> no [13:47:32] <_joe_> but it should really not change anything, I plan on running it anyways [13:47:50] k [13:47:50] <_joe_> we don't allow multiple roles since forever now [13:47:54] yeah [13:48:55] <_joe_> hashar: the package is uploaded to wikimedia-buster component/ci [13:49:25] _joe_: confirmed. Thank you so much ;]] [15:13:24] elukey: https://phabricator.wikimedia.org/T252027#6163226 [15:15:44] kormat: <3 greaaaattttt! [15:16:26] so IIUC we'll have to add the script to partman's "collection" and then use it? [15:17:01] something like that, yes. how exactly to deploy this best is something i'll have to discuss with people who do less insane things, [15:17:28] i've been testing it by simply wget'ing it in partman/early_command [15:18:26] elukey: i'll keep you updated when it's more production-ready [15:23:33] super, thanks a lot [15:47:27] kormat: <3 [15:47:51] for not being afraid of partman/d-i :) [15:49:35] elukey: what is your use case, keeping /srv too, or something else? [15:50:57] based on T234629#6158278 I think it is the same correct? [15:50:57] T234629: Move the Analytics infrastructure to Debian Buster - https://phabricator.wikimedia.org/T234629 [15:51:35] fyi, in 10min I'm going to move the ulsfo peering port, no impact expected [15:52:27] jynus: yes exactly, and I think it is a use case that we all have with the new standardized partitioning scheme.. For example, all kafka brokers in theory can survive /srv being obliterated (since they can ask to other replicas) but it would be better not to if possible [15:52:44] same thing for druid [15:53:03] so having partman aware of /srv would speed up a lot the road to buster (in theory) [15:53:29] oh, I was just asking in case it was other than lvm over /srv, maybe you had another mount point of large hadoop servers [15:53:49] if it is /srv it will be much easier to share kormat's work [15:54:44] also, elukey which filesystem format do you use for /srv, ext4? [16:00:43] yep [16:02:23] very nice re: being able to keep /srv across reimages if we need to !