[02:41:33] 10Domains, 10HTTPS, 10Traffic, 10DNS, 10Operations: Merge Wikipedia subdomains into one, to discourage censorship - https://phabricator.wikimedia.org/T215071 (10Pmj) China and Turkey seem to be the only countries [[https://en.wikipedia.org/wiki/Censorship_of_Wikipedia|blocking Wikipedia]] at the moment,... [03:23:32] 10Traffic, 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: nginx is failing to restart on cloudelastic100[1-2].wikimedia.org. Will also fail on cloudelastic100[3-4] when restart is attempted. - https://phabricator.wikimedia.org/T223734 (10Krenair) @mathew.onipe, how are things on cloud... [03:28:33] 10Traffic, 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: nginx is failing to restart on cloudelastic100[1-2].wikimedia.org. Will also fail on cloudelastic100[3-4] when restart is attempted. - https://phabricator.wikimedia.org/T223734 (10Mathew.onipe) Things are Ok now. Thanks! [03:28:54] 10Traffic, 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: nginx is failing to restart on cloudelastic100[1-2].wikimedia.org. Will also fail on cloudelastic100[3-4] when restart is attempted. - https://phabricator.wikimedia.org/T223734 (10Mathew.onipe) 05Open→03Resolved [04:36:15] 10Domains, 10HTTPS, 10Traffic, 10DNS, 10Operations: Merge Wikipedia subdomains into one, to discourage censorship - https://phabricator.wikimedia.org/T215071 (10TerraCodes) >>! In T215071#4959749, @Liuxinyu970226 wrote: > @Platonides >> However note that in the event that your browser can't support TLS v... [04:38:06] 10Traffic, 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review: nginx is failing to restart on cloudelastic100[1-2].wikimedia.org. Will also fail on cloudelastic100[3-4] when restart is attempted. - https://phabricator.wikimedia.org/T223734 (10Krenair) a:03Krenair [06:45:22] 10Traffic, 10Operations, 10serviceops, 10HHVM, 10PHP 7.2 support: Increased instability in MediaWiki backends (according to load balancers) - https://phabricator.wikimedia.org/T223952 (10Joe) [06:45:46] 10Traffic, 10Operations, 10serviceops, 10HHVM, 10PHP 7.2 support: Increased instability in MediaWiki backends (according to load balancers) - https://phabricator.wikimedia.org/T223952 (10Joe) p:05Triage→03High a:03Joe [06:47:52] 10Traffic, 10Operations, 10serviceops, 10HHVM, 10PHP 7.2 support: Increased instability in MediaWiki backends (according to load balancers) - https://phabricator.wikimedia.org/T223952 (10Joe) While there is no evidence that the increase in traffic sent to php7 is the cause of this increase in errors, the... [09:00:30] 10Traffic, 10Operations, 10serviceops, 10HHVM, and 2 others: Increased instability in MediaWiki backends (according to load balancers) - https://phabricator.wikimedia.org/T223952 (10Joe) While we have to wait and see if the absence of php7 traffic improves the situation (and in that case, why is that the c... [15:15:13] wikibugs never made it back into here, hmmm [15:23:32] bblack: did any task got updated? [15:23:50] because I've restarted it and it made it back, but it connects only on the flight the first time it has something to say [15:24:00] *back into -operations [15:30:12] yeah maybe my last task update was before [15:59:24] 10netops, 10Operations, 10cloud-services-team (Kanban): network hardware: drop nova-network related configuration - https://phabricator.wikimedia.org/T223925 (10ayounsi) [16:02:20] here we go, welcome back wikibugs [16:13:13] setting lvs1013/1014 to "active" in Netbox [16:41:11] XioNoX: what's up with cr1-codfw interface errors and now a minor alert on ripe probes there? [16:41:34] 16:33 < librenms-wmf> Warning Alert for device cr1-codfw.wikimedia.org - Outbound interface errors [16:41:51] then at :38 that cleared and an inbound one appeared [16:41:54] now ripe [16:42:11] (recovery) [16:43:08] anyways, going to start poking at LVS in a few (disable puppet on all the eqiad LVSes, merge a puppet change, then carefully puppetize while stopping/starting pybal in various places to transition to the new primaries, etc) [16:44:32] bblack: i think that's all related to the emails XioNoX has been exchanging with Telia [16:44:50] interface errors is tracked in https://phabricator.wikimedia.org/T222967 [16:45:01] ok [16:45:30] we go a drop of traffic from Telia, which probably caused the RIPE alert, so I'm wondering if they're not looking at it [16:45:49] see the small drop on the right: https://librenms.wikimedia.org/graphs/to=1558456800/id=8288/type=port_bits/from=1558435200/ [16:46:44] the big one is me draining that link for fiber cleanup, but the small one is probably Telia [17:58:14] 10Traffic, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293 (10BBlack) **Current status of transition:** New hosts: lvs1013 is primary for high-traffic1 lvs1014 is primary for high-traffic2 lvs1015 is primary for low-traffic lvs1016... [18:03:54] 10Traffic, 10Operations: LVS interface settings from /e/n/i not consistently applied on first boots - https://phabricator.wikimedia.org/T224027 (10BBlack) FWIW, lvs1016 came back with correct settings after the single additional reboot above. [18:06:54] XioNoX: can you switch the eqiad LVS static routes around? [18:07:14] randomly :-P [18:07:17] the static routes that were facing lvs1001 are now to lvs1013. lvs1002 becomes lvs1014. [18:07:29] and the ones that are currently set to lvs1016, now become lvs1015 [18:09:02] bblack: on it! [18:16:06] bblack: no AAAA btw for the LVS? [18:18:23] XioNoX: I think we've been missing it in many cases for quite some time, last I checked [18:18:40] yeah, the old lvs don't have them neither [18:18:51] we probably should fix that, but separate issue :) [18:19:10] (either way, traffic gets routed, it's just DNS convenience for the v6 side) [18:19:36] yeah, I noticed as I have to update the v6 route as well so was looking for the IP [18:19:47] but not an issue [18:22:17] ah [18:22:17] yeah [18:22:18] bblack: can you send me the decom task for lvs1001/2 when you have it? want to add a note to delete the "protect-old-lvs-servers" firewall filters so I don't forget [18:22:18] sure [18:22:18] I imagine it will either be thursday, or next week, depending on how the rest of this goes [18:23:04] (that we start the first steps of decom, before that I don't want to make a ticket and possibly prompt bad things early) [18:24:12] right now they're still viable backups if something goes horribly wrong on the first day these new ones are in service [18:24:27] 1016 backs them all up too anyways, but in case the affects all the new hardware/setup equally heh [18:29:22] bblack, does that look good? https://www.irccloud.com/pastebin/N6gDlRN0/ [18:33:31] XioNoX: +1 - manually rechecked [18:33:41] thanks! [18:36:50] done [18:37:49] ok, thanks [18:37:56] carrying on with the risky business! [18:42:37] 10Traffic, 10Operations, 10ops-eqiad: cp1083 crashed - https://phabricator.wikimedia.org/T222620 (10BBlack) It's been up for ~15 days now without incident, but depooled. Re-pooling it today to see if we can get a recurrence or not. [18:46:16] 10Traffic, 10Operations, 10ops-eqiad: cp1083 crashed - https://phabricator.wikimedia.org/T222620 (10BBlack) Nevermind, apparently it was already repooled, looking at the wrong thing here... [18:48:57] so I got confused about 1083 becaues it was the only one with an apparent ticket [18:49:03] but it's actually 1085 that's missing [18:49:20] 1085's frontend services are depooled in etcd, but backend is pooled, no ticket that I see [18:50:12] heh apparently since ~ Apr 16th I'm guessing from SAL [18:50:30] there was a "depool" execution then later a backend restart, but nothing to repool the frontends [18:50:54] !log repooling cp1085 frontends (weren't meant to be depooled) [18:50:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:33] ah, sorry about that, Brandon [20:30:20] we should probably add an icinga check like the one for puppet disabled for too long [20:31:13] seems like a good idea for conftool instances in general [20:31:30] yeah that's what I meant [20:32:11] would be nice for discovery objects too, but only for those configured as a/a [20:32:28] do you know offhand the interval for the puppet alert? [20:34:35] cdanis: [20:34:35] $warninginterval = $puppet_interval * 60 * 6 [20:34:35] $criticalinterval = $warninginterval * 2 [20:35:12] thse are in seconds [20:35:50] $puppet_interval is 30, so 3h warning, 6h critical I'd say [20:36:08] I almost think we'd want longer for this, but I don't know [20:40:55] depends, we usually depool either for a couple of hours at most for maintenance, deployments, quick hardware maintenance [20:41:05] or for days for longer things, like broken disks [20:41:47] so maybe warning after 12h and critical after 3 days? dunno, just saying random numbers almost :D [20:42:12] it'd be easier to solve if there was a way to attach task# to depooling itself [20:42:36] but we can always say if you plan to go long, you need to go find the (probably already-warning) icinga alert and silence it and phab-tag it [20:43:28] can we auto-generated alerting for conftool stuff, based on the hostnames in conftool, attached as icinga services to the host [20:44:16] that would be nice [20:44:31] e.g. cp1085 could have an icinga service auto-defined which is "Service cache_text/varnish-fe is pooled" [20:46:41] I'd bikeshed it as 3h warning 24h crit [20:47:26] (downtiming a host and all its services will silence it anyways by default if it's attached to the hosts properly) [20:49:14] what would be a viable mechanism for implementing that? [20:54:54] if you want it to be attached to the host it should be done via NRPE as a host monitoring [20:55:13] that's the easiest way to have it downtimed when downtiming a host [20:57:26] and you could parse confctl --quiet select name=$(hostname -f) get [20:57:30] one json per line [20:58:05] ofc we need to decide what to do for half-pooled stuff [20:58:24] and there might be an easier way to do it too :) [21:00:30] unless we go the way of one NRPE check per pooled service, but might create more noise than not [21:11:25] and of course you can do the same via python using conftool as a library [21:33:42] forgot to mention the fun part cdanis ... last modified time ;) [21:34:06] i think it is probably very hard to do this without using conftool as a library, and possibly without modifying conftool [21:34:19] I don't think we store it [21:34:39] there are ofc created and modified indexes by etcd [21:34:47] not very useful for this [21:37:55] fwiw I kinda like the idea of also allowing a depool note/reason in conftool itself [21:38:37] that's a long standing discussion, bran.don's original request IIRC was to have a stack of them and a host is pooled only if the stack is empty [21:39:04] so I can depool 'maintanance task...' and you can depool 'deploying gerrit foo' [21:39:14] oh, like the thing everyone pretends we have with disable-puppet, but don't actually have [21:39:15] and we don't interfere within each other [21:39:20] 🙃 [21:39:24] yep [21:39:36] ok I really have to go now, ttyl [21:39:38] as a partial workaround we'll have that feature in spicerack [21:39:40] ttyl [21:39:46] I'm going off too [22:17:05] 10Traffic, 10Wikimedia-Apache-configuration, 10DNS, 10Matrix, 10Operations: Configure wikimedia.org to enable *:wikimedia.org Matrix user IDs - https://phabricator.wikimedia.org/T223835 (10Tgr) [22:25:56] 10Traffic, 10Wikimedia-Apache-configuration, 10DNS, 10Matrix, 10Operations: Configure wikimedia.org to enable *:wikimedia.org Matrix user IDs - https://phabricator.wikimedia.org/T223835 (10Krenair) Does the Foundation have an NDA with modular.im? [22:31:40] 10Traffic, 10Wikimedia-Apache-configuration, 10DNS, 10Matrix, 10Operations: Configure wikimedia.org to enable *:wikimedia.org Matrix user IDs - https://phabricator.wikimedia.org/T223835 (10Dzahn) The comments on T215042#4977385 sounded like this wasn't going to be done, for the temporary evaluation that... [23:45:03] 10Traffic, 10Wikimedia-Apache-configuration, 10DNS, 10Matrix, 10Operations: Configure wikimedia.org to enable *:wikimedia.org Matrix user IDs - https://phabricator.wikimedia.org/T223835 (10Tgr) >>! In T223835#5202604, @Krenair wrote: > Does the Foundation have an NDA with modular.im? NDA for what? This...