[07:46:31] jynus: good morning, Jaime! ok to go ahead with https://gerrit.wikimedia.org/r/#/c/233671/ ? [07:46:51] let me open my monitoring, and lets do it [07:47:25] ok, didn't notice you wrote that five minutes ago :-) [07:47:49] oh, didn't want to bother you, as it is not time-critical [07:48:03] would have pinged you before office hours [07:49:21] moritzm, ready [07:49:32] do I merge or do you? [07:49:38] I'll merge [07:49:54] let's log it first [07:50:29] technically there is probably going to be a disruption, even if only for some seconds [07:50:52] thanks [07:50:53] sure, done [07:51:00] making a puppet run on 1043 now [07:51:56] and we are back [07:52:33] rules are up, nothing dropped so far [07:52:40] hardly disrupting [07:52:54] phabricator search working for me as well [07:53:13] low traffic apps only see a spike of latency [07:53:19] so not an issue [07:53:38] are the codfw phab dbs in production? [07:53:47] only thing that it could be is some long term things, like the bot for statistics or dumps [07:54:03] but that should be inside the internal network in any case [07:54:06] otherwise I'd just have them have their cronned puppet runs [07:54:36] yeah, I'll leave the logging enabled on 1043 and check up on it for a while [07:54:39] I think there is only 1 node, in a passive way (replicating, but not receiving user traffic) [07:54:59] on labsdb100[1-3] nothing has been dropped yesterday or today, so we should be good there as well [07:55:30] yes [07:55:59] as I said, my main concern right now would be the heavy-connection datbases and the masters for the wikis [07:57:13] true, these will be a bigger challenge, shall we go ahead with db1069/sanitarium as well? [07:57:29] yes [07:57:38] ok, preparing a patch [07:57:52] those are important hosts but receive no app traffic [07:58:14] so network-wise are very easy to check, with a ver specific role [07:59:24] https://gerrit.wikimedia.org/r/235417 for db1069/sanitarium [08:00:06] +1 [08:01:43] merging [08:01:57] and logged [08:03:30] making a puppet run [08:06:04] lost contact from the monitoring [08:06:19] oh, here it is again [08:06:24] it took more time than usual [08:06:34] probably because it is more loaded and more ports open [08:08:39] monitoring is back to normal, replication is working [08:08:43] one dropped connection from iron, was that you? [08:09:02] yep [08:09:10] cannot connect from iron, true [08:09:36] so aside from the six connection attempts by you all looks good now [08:09:50] yes, lets add the iron rule [08:10:05] ok, let me prepare a patch [08:10:15] not a big issue [08:10:31] the iron thing is for us to administrate from a central place [08:10:47] sure, you need that for other mariadb classes as well or specifically for sanitarium? [08:11:22] we generally run mysql queries from iron to all db, es, pc, etc. hosts [08:12:09] terbium and tin access is not needed for sanitarium [08:13:38] for example, I was running a "watch "mysql -e \"SHOW PROCESSLIST\" remotely [08:13:47] ok, will prepare a patch [08:14:19] I think we added it to the main class, I forgot on other less important classes [08:14:26] so my fault [08:15:34] the rule to access from iron is already in the common ferm mariadb::ferm class (but limited to the standard 3306/3307, we only need to amend the additional ports sanitarium uses [08:19:11] ah [08:19:13] of course [08:19:18] :-) [08:19:38] jynus: https://gerrit.wikimedia.org/r/235419 [08:20:01] in the future, I would like to get rid of the 7 instances [08:20:08] and with it, the ports [08:20:14] but that is not going to be soon [08:21:00] +1 [08:21:17] sounds good, I'll merge the patch now [08:23:24] I can access now [08:24:09] great, also no further dropped traffic, I'll keep an eye on it for the next days, but I think we're all good here [08:25:29] for sanitarium do not worry [14:30:28] jynus: I checked the ports for role::mariadb::analytics, do you think it needs additional accessors outside of $INTERNAL? i can run tcpdump on it for a bit to check for current clients [14:56:10] if you can, yes, but I was merely -1ing to leave them for later, because they are less straigthforward, and I actually have a meeting with those users on Tuesday [14:56:35] I do not want you to waste time if I can solve it myself [14:58:58] sure, I'll do that tomorrow so gather some data