[07:26:36] moritzm: thanks for fixing my brainfart in admin/data :) [07:29:02] <_joe_> kormat: if only we had type checks on the content of admin.yaml :P [07:29:11] <_joe_> and some spec tests for the admin module [07:29:31] how soon will we have puppet rules to write puppet data? :) [07:31:22] <_joe_> kormat: the type system in puppet is basically a data validation system :) [07:32:22] <_joe_> with some added awesomeness because implicit casting between types is sometimes allowed, sometimes not [07:33:25] <_joe_> as usual, brandon's first law of puppet holds https://bash.toolforge.org/quip/AVfTAUmefIH_7EDsriqu [07:33:51] oh god :) [09:17:51] <_joe_> rzl: I love httpbb every day more [09:18:16] <_joe_> everyone should know and use it, I'm thinking it's useful even from my computer for testing externally. [13:52:02] you gotta appreciate the RAID utility literally saying "Status: �" [13:54:25] _joe_: \o/ [13:56:50] <_joe_> rzl: oh, that means you will need to make a proper debian package 😱 [13:57:01] sorry, you broke up for a minute, I couldn't quite hear you [13:57:11] kormat: btw puppet rules to write puppet data have been proposed before, see also https://wikitech.wikimedia.org/wiki/Cergen#Future_work [13:57:34] <_joe_> oh the Cergen topic [13:57:52] hah :) [13:57:55] <_joe_> I used cfssl for a docker compose thing to test envoy this week [13:58:01] rzl, if you need help and guidance for making a Debian package, there are a bunch of people capable of doing that, so do ask [13:58:04] <_joe_> it's so much better than cergen :P [13:58:13] liw: thanks! [14:01:53] kormat: wow you've used dbctl already? don't look too closely at that sausage [14:03:06] don't worry, he won't notice because he will be staring at the code using dbctl, which is much worse! [14:03:24] :-D [14:03:30] 0:-) [14:21:20] cdanis: fortunately i live somewhere that has a deep appreciation for sausage [14:23:27] kormat: okay well then I hope this is tasty https://gerrit.wikimedia.org/g/operations/mediawiki-config/+/77eccc165e73dd2d137ca91f408ed295c2d87d3f/wmf-config/etcd.php#45 [14:24:06] it's written in php? 😮 [14:25:11] no, dbctl (and conftool) are in python, in another repo [14:25:31] this is the 'config' code that glues its output into the internal datastructures used by mediawiki [14:27:47] dbctl code is here, if you are morbidly curious https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/conftool/+/master/conftool/extensions/dbconfig [14:28:37] ahh [14:36:02] in my mind, in the future dbctl should orchestrate a proxy service, not php code, but that's how it works now :-D [14:40:07] yeah, that seems better :) [14:40:49] which is exactly how conftool/pybal works, but those have the privilege of being on a top layer :-D [14:41:04] dbs being on a lower one [14:43:06] cdanis: it is all planned on T119626, 3 or 4 days of work at most :-) [14:43:06] T119626: Eliminate SPOF at the main database infrastructure - https://phabricator.wikimedia.org/T119626 [15:36:22] o/ [16:44:28] elukey: do I recall correctly that the things blocking upgrading memcache hosts past jessie was having either the gutter pool live or doing the DC switchover? [16:44:52] yeah, I think that's correct [16:44:56] like, we can reimage memcache hosts in the secondary datacenter, but not in the primary, unless we have gutter pool? [16:45:41] cdanis: here's the perfect solution, failover memcache to codfw while keeping mw on eqiad :-P [16:45:44] * volans hides [16:46:00] cmon volans that'd never work [16:46:05] yeah, but the reimage approach is a bit of a gamble if one only learns with the DC failover if the new setup works fien (given new OS and new memcached release) [16:46:10] what we have to do is move the mc2* hosts into eqiad, for latency reasons [16:46:23] lol [16:46:34] moritzm: I mean, not having a testbed environment, or additional hardware, are each their own problem, yes [16:47:06] yeah, totally [16:50:22] what was the status of gutter pool testing? all I remember was that ef.fie was working on it, some initial results, but ofc she isn't now [16:51:36] most of the way there, and handed over to elukey to finish up I believe, not sure of the latest [16:52:28] (of course that was the beginning of march, in the Before Times) [17:28:22] <_joe_> cdanis: no that's not a very accurate characterization of the problem [17:28:33] <_joe_> of either of them actually [17:28:41] <_joe_> so. [17:29:06] <_joe_> Upgrading memcached was postponed waiting for redis to be dismissed there, as we really wanted not to have to manage a transition to redis 5 [17:29:39] ohh, that's right [17:29:57] <_joe_> at the same time, having the gutter pool on buster allows us a relatively low-risk way of getting accustomed to tuning a new memcached version, with different tunables and slab allocation algorhythms, to our reality [17:30:06] <_joe_> we've had to do that in the past too [17:30:44] <_joe_> now, that can be easier using mcrouter, that allows you to shadow a % of the traffic to a secundary pool [17:30:59] <_joe_> we might want to do that, but that's the kind of testbed you want [17:31:48] <_joe_> as for the gutter pool, I honestly have no idea what is the current situation. There is a puppet patch, some test results reported on tasks, but I'd have to go and read through them to get a clearer picture of where we are [17:32:15] <_joe_> I'm not a huge fan of that puppet patch, and I agree with most of the observations made by aaron in CR [17:32:45] <_joe_> so I think next week or the one after that, I'll pick up that work, redo testing for the parts where I'm not sure what the status is, and deploy to production [17:40:47] sorry, dumb question, what does "waiting for redis to be dismissed there" mean? [17:41:34] "that'll do redis, thatt'll do" [17:43:17] <_joe_> cdanis: finish the migration to sessionstore, then reassess if there is any remnant use of it [17:43:23] ack [17:43:43] <_joe_> we've paused as we deemed it a bit dangerous for this period [17:44:05] <_joe_> but if the situation persists, we might decide to take the risk [17:44:17] <_joe_> we == serviceops and core platform [17:45:08] makes sense [18:17:36] elukey: _joe_: https://w.wiki/MvZ I literally just deployed this, lol [18:18:04] I think we'll find that our memcache hosts have micro-bursts all the time [18:20:21] nice graph! [18:21:01] so long as they're very micro, in theory the upper-layer protocols can smooth that out and handle it without too much hiccup [18:21:10] but yeah they're not a great sign [18:21:37] bblack: well, AIUI mediawiki has rather short RPC timeouts to memcached [18:22:47] they're all 1G except for the gutter set? [18:22:55] that link didn't give me a graph, only the list of dashboards [18:23:06] apergos: log in [18:23:12] hm am I not? woops [18:23:31] maybe i should file a FR upstream for a non-logged-in link to /explore to give an error message :) [18:23:52] bblack: that's right, and the gutter set isn't being used anywhere real yet (see just above) [19:08:54] cdanis: with the addition of the new host in wikidata today we're doing a lot better now [19:08:59] considering that db1092 is still depooled [19:09:04] I was just checking the graphs [19:09:06] marostegui: yeah I had meant to check [19:09:12] but was enjoying not seeing alerts :) [19:09:40] cdanis: still quite nice to see 10.4 (db1114 and db1111) performing better than 10.1 (db1126) with more weight even [19:09:47] also good news :) [19:25:02] +1 [19:49:35] awesome! [19:50:04] preliminary results say we have several memcached microbursts an hour [19:56:05] interesting [19:56:16] (much more interesting once logged in :-P)