[04:17:27] I am going to put phabricator in read only for a couple of minutes to restart the db primary master [07:14:33] andrewbogott: sure I'm fine to wait for base-files ! thanks for looking into it [07:58:27] godog: o/ - me and jayme are working on swapping the zookeeper nodes, and we should roll restart kafka in codfw to pick up changes [07:58:38] ok to roll restart logging-codfw? [07:58:55] the cookbook supports it (also checked the cumin labels) [07:59:20] elukey: yeah I think so, thanks for the heads up, I don't think the current lag alert is anything critical [08:00:19] perfect :) [08:09:14] I will be afk for a few hours this morning, need to go to malaga and pick up my residency card, officialy making me a EU citizen again :) [08:09:30] jbond42: welcome [08:09:43] :) thanks [08:13:26] good luck! [08:17:55] congratulations :-) [09:12:37] welcome back John! [09:35:00] * ema sings the 9th Symphony [10:16:43] thanks all :) [10:17:08] welcome back to the EU! [10:17:28] indeed all sorted now untill 2026 [10:18:03] well done jbond42 ! [13:18:44] godog: will roll restert kafka-logging-codfw again to have it pick up the last of the swapped zookeeper nodes [13:19:10] jayme: ack, thanks for the heads up! [13:19:23] elukey: same for kafka-main-codfw :) [13:21:40] jayme: +1 nice, I can take care of them if you are busy [13:21:51] elukey: it [13:22:08] it's fine. I've ~40min till the next interview :) [13:22:37] jayme: as you prefer, I am free now if you need time to prep for the interview [13:25:22] thanks. Will do so now while the restarts are running. Maybe you can just take a look in an hour in case I missed something [13:25:55] and roll the mirror-maker if everything is fine [13:29:51] +1 [13:29:56] really nice job btw! [13:36:43] :) thanks for your support [13:42:56] I'll be draining and rebooting sessionstore hosts over the next hour or so [13:45:29] cdanis: we've got a kafka lag alert for logging-codfw on consumer group cdanis-kafkacat, known/expected ? [13:45:37] https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=thanos&var-cluster=logging-codfw&var-topic=All&var-consumer_group=All that is [13:46:13] ah interesting, I guess that kafkacat didn't terminate cleanly? [13:46:23] it's not of any concern fwiw [13:49:10] cdanis: ack, thanks! is there anything actionable atm to clear the alert and/or investigate ? [13:52:33] I'll take a look [13:55:04] What actually happened this morning? The IR mentions 500 errors but did commons go fully down or just RO. [13:55:59] RhinosF1: the primary database was wedged, aiui [13:56:32] godog: I believe that restarting kafkacat and then ctrl-c'ing it again fixed it [13:56:46] cdanis: lol, indeed! thank you for taking a look [13:57:12] RhinosF1, more on report coming soon, but for now you can read what I added to: https://commons.wikimedia.org/wiki/Commons:Village_pump#Ongoing_outage_on_commons [13:57:47] Ty jynus [14:25:59] marostegui: o/ [14:26:24] we are going to ask for 2 mysql servers to separate out the existing dbs on an-coord100[12] in the next FY [14:26:39] what should we call them in the capex sheet? [14:26:43] i know we can always adjust later [14:26:45] but [14:26:54] an-db1001? [14:27:03] or should we stick with the regular db naming convention? [14:27:06] db10xx? [14:27:09] ottomata: I would assume that goes for your team's budget, but I have no idea about all the process maybe sobanski can give you some light on that [14:27:17] right right its on our budget [14:27:27] just asking if you all have a preference for what they'd be named [14:27:29] ottomata: if it going to be to split an-coord stuff, probably use your own naming convention I would suggest [14:27:34] for refecence on the sheet [14:27:34] as those are owned by you at the moment [14:27:40] ok cool [14:27:46] luca said I should ask you [14:27:56] i'll call them an-db100[12] (at least for now) [14:28:07] ty [14:28:27] ottomata: If they will be owned by your team (like the current an-coord ones) I am fine with an-dbXXXX indeed [14:28:31] k cool [14:28:35] thanks :* [14:37:22] marostegui: the idea was to have something a bit more standardized like db1108 [14:37:59] but of course with Analytics doing most of the maintenance, caring about the replication etc.. [14:38:21] ah, it will be like db1108? [14:38:22] but following data persistence's guidelines (kernel, puppet config, mariadb version) [14:39:00] it should become an active/standby set up, maybe with dbproxies (on an-coord100x ?) or not, but with multi-instance + replication [14:39:30] my fear is that if we keep our own hosts then we'll end up following different best practices etc.. (as we do for an-coord100x) [14:39:39] if it is going to follow that, then I'm fine with db1xxx [14:39:49] ottomata: --^ [14:39:51] I thought it was going to be a customized installation [14:40:11] nono it should follow what we use in production as much as possible [14:40:13] can we discuss this tomorrow? I'm exhausted, started at 6am today [14:40:21] marostegui: of course :) [14:40:32] elukey: then I'm fine with db1xxx [14:50:22] ok db1xxx it is [14:53:34] if it helps the reasoning about names [14:53:55] the reason we name most things db* is because they are technically interchangable [14:54:03] if they have the same spec, it is not a bad thing [14:54:31] e.g. so that in the future we can lend eachother hardware easier than having to relavel [14:54:59] *re-label [14:55:48] for example, db* hosts for backups are a different role we traditionally call "dbstores" [14:56:05] but we call them db*s because we could pool them into production in an emergency [15:25:41] Ah, the good old days, when backups were taken, 51 years ago :-) :https://grafana.wikimedia.org/d/413r2vbWk/bacula?orgId=1&var-site=eqiad&var-job=people1003.eqiad.wmnet-Monthly-1st-Sun-production-home&from=1619492511586&to=1619508775373&viewPanel=4 [20:15:47] jynus: interesting bug. took me a while to reproduce in isolation but got it: https://w.wiki/3FLT [20:16:16] It looks like at a handful of cases in the last 7 days, it genuinely pulled a zero of int(0) for that metric [20:16:34] most of the rows returned however have a null / nothing there [20:17:16] but when there is a last non-null value within range, it uses it, and as a timestamp that's 1970/51y ago, so probably not a Grafana bug [20:20:21] better link: https://w.wiki/3FLX