[09:13:10] hey _joe_, can we help somehow to speed up updating nginx on conf* instances? (regarding T164456) [09:13:11] T164456: Migrate to nginx-light - https://phabricator.wikimedia.org/T164456 [10:00:32] volans: so the 503s in ulsfo you were mentioning might be due to traffic being routed away from eqord to codfw perhaps [10:00:37] https://librenms.wikimedia.org/device/device=140/ [10:00:41] https://librenms.wikimedia.org/device/device=92/ [10:00:50] yeah that's what I thought [10:01:15] from the router alarm too I guessed that the current ongoing maintenance has put down one link [10:01:24] and we re-reouted to another route [10:02:32] xe-1/2/0: down -> Transport: cr1-eqord:xe-0/0/1 (Telia, IC-313592, 51ms) {#1502} [10Gbps wave]; [10:16:29] vgutierrez: https://gerrit.wikimedia.org/r/#/c/425782/ [10:22:35] nice catch [10:25:18] hmmm [10:25:33] according to aptitude on jessie it's working because texinfo is installed [10:25:43] so.. pure luck [10:25:53] :) [10:28:28] BTW, funny fact, on jessie w / who reports FQDNs and in stretch reports IPs [10:29:24] but w(1) manpage on stretch assumes that the default output should be FQDN [10:29:34] ema: do you know if tlsproxy::localssl is meant to be used as a general purpose also for internal TLS stuff? [10:30:41] vgutierrez: if you have a minute https://gerrit.wikimedia.org/r/#/c/425495/ ;) [10:30:54] volans: we do use it for swift, though I'm not sure if that fits your idea of internal TLS stuff [10:31:23] TIL: `w -i` shows the IP instead of FQDN in the from field [10:31:30] eheheh, let's say that to be used by me I have to add 2 parameters to the define and a couple of IFs in the erb template [10:32:05] what I'd like to know is if we have a policy to use it because it ensure that good TLS config is the same across the fleet [10:32:17] or what is the best practice for internal-only TLS stuff [10:32:53] on nginx (to be precise) [10:33:33] ema: right, but why stretch shows IPs? :P [10:33:45] volans: oh.. I didn't comment on that, my fault [11:07:17] vgutierrez: procps seems to have changed significantly between jessie and stretch :) [11:07:50] for example, on stretch `watch` now shows the hostname on top [11:08:30] also compare `ldd /usr/bin/w.procps` [11:08:56] libsystemd.so.0 /o\ [11:09:06] OMG [11:09:26] we are going to see systemd on ring0 soon [12:49:22] vgutierrez: so yeah you've nerdsniped me. The different behavior of w isn't due to procps, it's actually /var/log/wtmp having the bastion's FQDN on jessie LVSs and the IP on stretch [12:50:07] yep.. I reached the same conclusion after checking the strace output of w [12:54:42] * vgutierrez is getting rid of jessie hosts one at a time... [13:20:59] vgutierrez: the dns hosts are ripe as well. dns500x and dns400x are already stretch and the puppetization is already set up to work seamlessly for the other sites for stretch, too. They just need their distro switched and reinstalled. [13:21:18] acamar.wikimedia.org,achernar.wikimedia.org,chromium.wikimedia.org,hydrogen.wikimedia.org,maerlant.wikimedia.org,nescio.wikimedia.org [13:21:28] ^ 2x hosts in each of codfw, eqiad, esams [13:22:11] some have hw replacements upcoming, though, so better wait for those to arrive (hydrogen, chromium) [13:22:15] each reinstall is slightly-complicated by the need to manually depool from the LVS recdns service, and take care that local LVS's /etc/resolv.conf direct dependency on that host is edited out temporarily [13:22:47] they all have replacement coming "soon", but the configuration works better on stretch anyways and we could wipe out the more-complicated jessie part of the puppetization. [13:24:08] also, I dug into the lvs+stretch single-interface case on lvs5003, digging through all the optimization work and sysfs stuff. everything looks perfect. [13:24:35] we might want to pause and take another careful look on the first multi-interface LVS we do just in case, but I bet it all works fine as well. [13:25:00] nice :D [13:26:44] bblack: lvs2006 looks a good candidate for multi-interface LVS testing on stretch [13:26:49] *as a [13:29:02] yeah sounds good [13:29:44] lvs1001-6, I'm hoping we can just leave those as jessie and let lvs1013-16 take over from them as stretch [13:30:08] those hosts are getting so old now, there's probably real risks of harm from the churn of reinstalling them :/ [13:30:25] so you are afraid of rebooting them :P [13:30:44] basically [13:31:01] well, and the general churn of the reinstall on the storage, could induce an earlier failure [13:31:44] B>* 208.80.154.224/32 [20/10] via 10.192.16.141, ens5, 00:08:36 [13:31:45] B>* 208.80.154.254/32 [20/50] via 10.192.16.139, ens5, 00:00:16 [13:31:51] quagga is back on pybal-test2002 [13:32:00] quagga 1.x is slighty different [13:32:55] 10Traffic, 10Operations, 10Pybal, 10Patch-For-Review: Upgrade LVS servers to stretch - https://phabricator.wikimedia.org/T177961#4126818 (10Vgutierrez) [13:32:58] 10Traffic, 10Operations, 10Pybal, 10Patch-For-Review: Upgrade pybal-test instances to stretch - https://phabricator.wikimedia.org/T190993#4126816 (10Vgutierrez) 05Open>03stalled [13:33:30] 10Traffic, 10Operations, 10Pybal, 10Patch-For-Review: Upgrade pybal-test instances to stretch - https://phabricator.wikimedia.org/T190993#4090290 (10Vgutierrez) Let's keep pybal-test2001 as jessie till we don't have any LVS on production running jessie [13:39:53] 10Traffic, 10Operations, 10ops-esams: cp3048 hardware issues - https://phabricator.wikimedia.org/T190607#4078006 (10jcrespo) It probably crashed today at 2018-04-12 13:31:20, hardware logs should be checked. [13:47:24] re: things to take care of when reinstalling recursors, we even wrote documentation for that! https://wikitech.wikimedia.org/wiki/Service_restarts#DNS_recursors_(in_production_and_labservices) [13:47:28] :) [13:47:38] <3 [13:47:52] https://gerrit.wikimedia.org/r/#/c/425811/ --> interface naming gets tricky in lvs2006 [13:48:14] chances of getting it right at the first attempt: -3 [13:48:31] haha [13:49:33] I'm saving current interface names + mac addresses to double check after reimaging [13:49:35] easy to verify before/after mapping of rows/vlans via lldpdcli though [13:49:50] that as well :) [13:50:30] probably the correct general approach going forward (e.g. brand new LVSes being installed with stretch) is get through the installer first, and then before the first puppetization jump into the fresh install with the new_install key and map out the interfaces with lldpdcli [13:50:48] (which means we can't really rely on the auto-imaging scripts when bringing up a fresh host like that) [13:51:45] either that or we figure out a reliable physical mapping that works between puppetization + dcops [13:52:24] but LVSes are low-count host types and replaced infrequently, so the manual check before first puppetization isn't awful, and is more reliable. [13:55:51] hmmm LLDP port description is auto-generated? [13:56:08] https://phabricator.wikimedia.org/P6986$36 like that one... lvs2006-eth1 [13:56:28] or we should ask XioNoX to rename the ports after upgrading to stretch? [14:09:35] bblack, ema: may I proceed with lvs2006? [14:09:44] vgutierrez: +1 [14:10:19] vgutierrez: the port descriptions are manual, and we probably should update them to match, but they're just for humans to stare at AFAIK, so it's not urgent in any technological sense. [14:12:22] Unable to negotiate with UNKNOWN port 65535: no matching cipher found. Their offer: aes256-cbc,aes128-cbc,3des-cbc [14:12:24] hmmm [14:12:33] lvs2006 is old I guess [14:12:42] (sshing the mgmt interface) [14:13:12] vgutierrez: https://phabricator.wikimedia.org/T171041 [14:16:23] TL;DR - some servers' firmware is old, and new ssh can't connect to it by default, and "To connect one explicitly needs to pass oKexAlgorithsm=diffie-hellman-group14-sha1 (and in some cases also -oCiphers=aes256-cbc)" [14:16:58] yup, -oCiphers did the trick here [14:24:10] 10Traffic, 10Operations, 10Pybal, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4127030 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on neodymium.eqiad.wmnet for hosts: ``` lvs2006.codfw.wmnet ``` The log can be found in `/var/lo... [14:43:54] lvs2006 is complaining since March 17th regarding the iLO Flash card [14:45:19] 10Traffic, 10DC-Ops, 10Operations, 10ops-codfw: lvs2006 Embedded Flash/SD-CARD iLO errors - https://phabricator.wikimedia.org/T192082#4127086 (10Vgutierrez) [14:48:22] 10Traffic, 10Android-app-feature-Compilations, 10Operations, 10Reading-Infrastructure-Team-Backlog, and 2 others: Determine how to upload Zim files to Swift infrastructure - https://phabricator.wikimedia.org/T172123#4127110 (10Mholloway) a:05Mholloway>03None [15:08:36] 10Traffic, 10Operations, 10Pybal, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4127185 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['lvs2006.codfw.wmnet'] ``` and were **ALL** successful. [15:42:20] 10Traffic, 10Operations, 10Pybal, 10Patch-For-Review: Reimage LVS servers as stretch - https://phabricator.wikimedia.org/T191897#4127256 (10Vgutierrez) [15:43:04] lvs2006 looking good <3 [15:44:37] lldpcli confirms the interface naming match, eth0 -> eno1, eth1 -> eno2, eth2 -> ens1f0, eth3 -> ens1f1 [15:45:11] easy hack [15:45:11] root@lvs2006:~# dmesg |grep renamed [15:45:11] [ 2.928395] bnx2x 0000:03:00.1 eno2: renamed from eth1 [15:45:11] [ 4.909760] bnx2x 0000:04:00.1 ens1f1: renamed from eth3 [15:45:14] [ 5.073630] bnx2x 0000:04:00.0 ens1f0: renamed from eth2 [15:45:16] [ 5.173853] bnx2x 0000:03:00.0 eno1: renamed from eth0 [15:49:44] probably the rest of lvs2 will be the same, it was all the same hardware order at the same time. [15:54:52] yep.. I've just confirmed it with cumin [15:55:04] we'll get the same interface naming on every lvs2* [15:58:01] nice :) [16:01:12] 10Traffic, 10Operations, 10Pybal: Rename lvs* LLDP port descriptions after upgrading to stretch - https://phabricator.wikimedia.org/T192087#4127315 (10Vgutierrez) p:05Triage>03Low [16:05:36] bblack: would you have a few minutes to finish reviewing my new qdqs-internal service? [16:05:47] so icinga is all green in lvs2006, ipvsadm looks good, loopback ip addresses are tagged as expected.. I'm going to reenable bgp [16:14:07] gehel: yeah [16:14:38] bblack: thanks! as reminder: https://gerrit.wikimedia.org/r/#/c/424599/ and https://gerrit.wikimedia.org/r/#/c/424587/ [16:17:13] BGP looking good on pybal and crw-codfw side \o/ [16:17:22] I think it looks good, +1 on those. You might want to sync up with ema or vgutierrez on deployment (as new services require pybal restart on low-traffic LVS in codfw+eqiad) [16:17:33] bblack: just to validate... to deploy the DNS change, I just need to merge and run authdns-update on ns0.w.o ? [16:17:42] and I donno what else they may be doing at ths ame time (e.g. lvs2006 vg is looking at right now, just reinstalled) [16:17:57] gehel: correct. and do that one before the puppet one. [16:18:03] bblack: thanks a lot for the reviews! [16:18:28] ema, vgutierrez: can we schedule a window to merge a new LVS service? [16:18:44] it's getting late here to do it today, but maybe early next week? [16:19:09] the pybal restarts in this case would be lvses 2006,2003,1006,1003 (the "low-traffic" ones in eqiad + codfw) [16:19:29] I saw "low-traffic" because in fact they're higher-traffic than then two we call "high-traffic" :P [16:19:33] s/saw/say/ [16:19:40] gehel: monday looks good? [16:20:34] vgutierrez: which timezone are you in? What time would work for you? [16:21:12] gehel: CEST/UTC+2 so before 18:00 my tz [16:22:26] vgutierrez: same timezone here! That makes it easy... Let's say 4pm our time [16:22:31] awesome [16:23:08] thanks! [16:26:34] LOL [16:26:58] it looks like we were doing the same on gcalendar gehel :P [16:50:04] vgutierrez: at least we now have all the necessary infos :) [17:35:14] 10Traffic, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 11 others: RFC: Use content hash based image / thumb URLs - https://phabricator.wikimedia.org/T149847#4127719 (10aaron) [21:44:05] 10netops, 10Operations: ulsfo<->eqord BGP down - https://phabricator.wikimedia.org/T192114#4128411 (10ayounsi) [22:36:52] 10netops, 10Operations: Enabling graceful-switchover causes core dumps on cr1-codfw - https://phabricator.wikimedia.org/T191371#4128548 (10ayounsi) Juniper's reply: > During the cleanup process, ksyncd will check for public nexthops to make sure that there are no public next hops remaining. If ksyncd finds a...