[01:09:35] ema: FYI for tomorrow: I've swapped in (1-for-1) half the new eqiad nodes now, using all the ones in rows C and D (so 1083-90, with odd being text and even being upload, and depooled the highest-numbered 4 nodes from each cluster's legacy set). so text@eqiad is now 4 old + 4 new, and upload@eqiad is now 7 old + 4 new (it had 11 to start). [01:10:16] ema: stopping there for today, will let that bake in case of unpredictable issues, maybe pick the process back up late tomorrow. [01:11:36] ema: (also, I cleaned up the historical oddball nginx/varnish-fe weighting in etcd, where text@esams had weight=9 nodes and upload@eqiad had weight=4 nodes. they now all match the conftool-data defaults with global weight=1 for nginx/varnish-fe, and 100 for varnish-be (well, ignoring cache_misc, which is also weirdly different from past transitions, but is also leaving soon)) [06:54:56] ema: do you need anything on my side to switch things like puppetboard and debmonitor from misc to text? [06:56:45] volans: yes! you could check if it works by modifying your /etc/hosts [06:57:01] volans: if it does, update dns (see https://gerrit.wikimedia.org/r/#/c/operations/dns/+/450513/ for the equivalent commit for grafana) [06:57:36] ema: ack [06:57:39] wilco [06:57:39] <3 [07:13:15] ema: https://gerrit.wikimedia.org/r/#/c/operations/dns/+/450909/ [07:13:49] the third-last checkbox in T164609 shouldn't be marked as done? [07:13:49] T164609: Merge cache_misc into cache_text functionally - https://phabricator.wikimedia.org/T164609 [07:16:44] 10Traffic, 10Operations, 10Patch-For-Review: Merge cache_misc into cache_text functionally - https://phabricator.wikimedia.org/T164609 (10Joe) Sometimes we get 503 peaks from a `cache_misc` application like phabricator or gerrit; knowing the origin of the 5xxs in broad categories ("public traffic for the sit... [07:17:24] volans: we haven't moved the IPs yet, changing DNS on a per-service basis seemed safer (and easier to rollback in case of trouble) [07:18:07] right, E_TOO_EARLY, I saw the link to the CR and that was merged, didn't parse that was an old one :D [07:18:43] volans: thanks for helping w/ puppetboard/debmonitor! [07:19:04] ema: yw, it was sudo -i or -E for authdns-update? [07:21:30] none of the above ofc, just sudo, but ema I see additional diffs [07:21:47] I'm on authdns1001 [07:22:56] on authdns2001 it shows only the correct diff [07:23:13] on authdns1001 it shows additional diffs from what seems to be previous unmerged commits, debugging [07:24:24] ema: NAMESERVERS="radon.wikimedia.org authdns2001.wikimedia.org eeden.wikimedia.org" [07:24:41] is that correct? as in is authdns1001 "depooled"? [07:26:14] volans: ns0=radon, ns1=authdns2001, ns2=eeden [07:26:20] so that looks correct [07:26:36] ok, so authdns1001 cannot be used for merging, it's kinda confusing [07:27:12] actually it might work from there, but authdns1001 is kept outdated [07:28:55] https://github.com/wikimedia/puppet/blob/production/modules/role/manifests/authdns/data.pp#L6 [07:29:11] I'm merging from 2001 for now [07:29:45] ty [07:29:59] (yes, it is a bit confusing) [07:30:36] it would be nice if the merge and sync part would be across all 4 also if one is not effectlively serving traffic [07:30:46] unless is something temporary [07:32:24] anyway, all good, sorry for the trouble ;) [07:58:27] 10netops, 10Operations: Intermitent connectivity issues in eqiad's row C - https://phabricator.wikimedia.org/T201139 (10jcrespo) New issue: there seems to be connectivity issues between es1014 (B1) and prometheus1004 (B4), not intermitent, they are unable to ping . ``` root@es1014:/run/mysqld$ ping prometheu... [07:59:38] 10netops, 10Operations: connectivity issues between several hosts on asw2-b-eqiad - https://phabricator.wikimedia.org/T201039 (10jcrespo) See T201139#4483590, probably more relevant here (diconnection between a B1 and a B4 host). [09:16:25] volans: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/450926 (not urgent) [09:18:17] how much would you hate me if I'd suggest to add the enable/disable puppet bits to the library :) [09:20:29] not much! :) [09:38:30] 10Traffic, 10DNS, 10Operations, 10Patch-For-Review: rack/setup/install dns100[12].wikimedia.org - https://phabricator.wikimedia.org/T196691 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on neodymium.eqiad.wmnet for hosts: ``` dns1001.wikimedia.org ``` The log can be found in `/v... [09:38:37] 10Traffic, 10DNS, 10Operations, 10Patch-For-Review: rack/setup/install dns100[12].wikimedia.org - https://phabricator.wikimedia.org/T196691 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['dns1001.wikimedia.org'] ``` Of which those **FAILED**: ``` ['dns1001.wikimedia.org'] ``` [09:38:56] 10Traffic, 10DNS, 10Operations, 10Patch-For-Review: rack/setup/install dns100[12].wikimedia.org - https://phabricator.wikimedia.org/T196691 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on neodymium.eqiad.wmnet for hosts: ``` dns1001.wikimedia.org ``` The log can be found in `/v... [09:40:15] 10Traffic, 10DNS, 10Operations, 10Patch-For-Review: rack/setup/install dns100[12].wikimedia.org - https://phabricator.wikimedia.org/T196691 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on neodymium.eqiad.wmnet for hosts: ``` dns1002.wikimedia.org ``` The log can be found in `/v... [10:01:50] 10Traffic, 10Operations, 10Goal, 10Patch-For-Review: Begin execution of non-forward-secret ciphers deprecation - https://phabricator.wikimedia.org/T192555 (10Johan) [10:01:53] 10Traffic, 10Operations, 10User-Johan, 10User-notice: Provide a multi-language user-faced warning regarding AES128-SHA deprecation - https://phabricator.wikimedia.org/T196371 (10Johan) 05Open>03Resolved [10:03:34] <3 [10:08:24] 10Traffic, 10DNS, 10Operations, 10Patch-For-Review: rack/setup/install dns100[12].wikimedia.org - https://phabricator.wikimedia.org/T196691 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['dns1001.wikimedia.org'] ``` and were **ALL** successful. [10:09:45] 10Traffic, 10DNS, 10Operations, 10Patch-For-Review: rack/setup/install dns100[12].wikimedia.org - https://phabricator.wikimedia.org/T196691 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['dns1002.wikimedia.org'] ``` and were **ALL** successful. [10:38:36] 10Traffic, 10DNS, 10Operations, 10Patch-For-Review: rack/setup/install dns100[12].wikimedia.org - https://phabricator.wikimedia.org/T196691 (10Vgutierrez) [10:53:35] volans: the comment here seems to be lying, run_puppet does not return anything? https://github.com/wikimedia/puppet/blob/production/modules/profile/files/cumin/wmf_auto_reimage_lib.py#L558 [10:54:19] indeed, guilty as charged! [10:54:30] sorry about that [10:55:58] must have been lost in some refactoring [12:06:40] authdns1001 is to replace radon, it just hasn't done so yet [12:12:32] bblack: ack [12:24:12] I can at least move it along a little :) [12:34:28] bblack: I'm gonna let dns100[12] take some work load from chromium & hydrogen in eqiad [12:35:17] ok [12:42:07] 10Traffic, 10Core-Platform-Team, 10Operations, 10Performance-Team, and 3 others: Harmonise the identification of requests across our stack - https://phabricator.wikimedia.org/T201409 (10mobrovac) p:05Triage>03Normal [12:50:21] this time I'll talk to XioNoX before shutting down the old dns servers to avoid issues with thet networking devices O:) [12:51:42] authdns1001 is ready to replace radon, too, which also involves some arzheling [12:51:56] (manually re-route the ns0 IP in router configs) [12:52:07] I should reboot it to new kernel now before it's in service though [12:52:21] !log rebooting authdns1001 [12:52:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:15] https://grafana.wikimedia.org/dashboard/db/dns-recursors?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-server=All&from=now-15m&to=now&refresh=30s [12:56:21] dns10[12] looking good [12:56:28] *dns100[12] [12:56:59] let's depool chromium & hydrogen [13:15:45] 10Traffic, 10Core-Platform-Team, 10Operations, 10Performance-Team, and 3 others: Harmonise the identification of requests across our stack - https://phabricator.wikimedia.org/T201409 (10Krinkle) [13:16:00] 10Traffic, 10Core-Platform-Team, 10Operations, 10Performance-Team, and 3 others: Harmonise the identification of requests across our stack - https://phabricator.wikimedia.org/T201409 (10Krinkle) Added see also: {T193050} [13:16:43] 10Traffic, 10netops, 10Operations: Use dns100[12] as ntp servers in eqiad networking equipment - https://phabricator.wikimedia.org/T201414 (10Vgutierrez) p:05Triage>03Normal [14:14:09] bblack: are the cp servers on asw-c-eqiad not in use anymore? [14:19:42] XioNoX: we're not at the point where we can decom anything yet [14:19:58] (in case we have to roll back any of the pooling changes) [14:20:18] but in good way :) [14:22:01] there's currently a few on asw-c that are still live for traffic, but that should be cleaned up by tomorrow am [14:22:24] then we'll wait a few days I think before really doing anything hardware level, just so we have a fallback plan at all [14:23:17] yeah of course [14:24:17] but once we reach that point, all eqiad cp servers (the 16x new, plus the 4x old we're keeping which is cp1071-4) will be on the new asw2-[abcd]-eqiad [14:24:38] the only exception I see is cp1008 is still on asw-a-eqiad and not going away, although perhaps it should be replaced, I donno [14:24:47] at the veyr least, we can move its port off of asw-a-eqiad at any time [14:25:18] arguably we should maybe switch our cp1008/pinkunicorn stuff over to one of the ones being decommed [14:25:51] (and then also move ports to new switches or whatever) [14:28:09] or have cp1008 match the hardware we use in prod? [14:34:31] 10Traffic, 10netops, 10Operations, 10IPv6: Fix IPv6 autoconf issues once and for all, across the fleet. - https://phabricator.wikimedia.org/T102099 (10BBlack) [14:43:12] well we don't like to rename hosts, it gets confusing on hardware, warranties, ticket history, etc [14:43:36] so probably it means steal some other to-be-decommed one like "cp1065" and make it the new pinkunicorn [14:47:44] XioNoX: at some point today, we need to shift around the authdns public IP routing to reboot/transition some authdns servers [14:48:16] authdns1001 needs to replace radon for the default ns0 destination in eqiad, and authdns2001 (current ns1 destination) needs a reboot [14:48:39] and eeden in esams (ns2 destination) needs a reboot too [14:48:45] or arguably, a stretch upgrade [14:48:58] bblack: sounds good [14:49:13] I'm working on the switches right now but later today works [14:49:16] ok [14:53:39] ema, hardware issues on cp2002? [14:55:36] vgutierrez: nope, anything wrong with it? [14:56:39] uh... [14:56:44] 16:51 < ema> [14:18:10] cp2002 just rebooted, looking [14:56:56] my bouncer is trolling me? :) [14:57:04] vgutierrez: haha, nope! [14:57:40] it did have troubles after reboot because varnish got upgraded (apt full-upgrade) [14:58:52] basically after a varnish upgrade we need to stop varnish and run puppet, which is something a forgot to do in this case! [14:59:16] ack [15:02:07] 10Traffic, 10Core-Platform-Team, 10Operations, 10Performance-Team, and 3 others: Harmonise the identification of requests across our stack - https://phabricator.wikimedia.org/T201409 (10Imarlier) Possibly related -- this should likely be implemented separately, but there's a slight chance that there's over... [15:08:24] chromium / hydrogen are almost ready to be decomm'ed :D [15:22:51] XioNoX: I've created T201414 cause IIRC we missed that while shutting down the old dns/ntp servers in codfw :) [15:22:52] T201414: Use dns100[12] as ntp servers in eqiad networking equipment - https://phabricator.wikimedia.org/T201414 [15:30:09] vgutierrez: thx I started to push the new config around [15:30:15] will let you know when done [15:30:24] thx <3 [16:15:23] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on neodymium.eqiad.wmnet for hosts: ``` cp5011.eqsin.wmnet ``` The log can be found in `/var/log/wmf-auto-reimage/201808... [16:26:57] 10netops, 10Operations: connectivity issues between several hosts on asw2-b-eqiad - https://phabricator.wikimedia.org/T201039 (10jcrespo) I think es1014 issue gone away (according to grafana)? [16:33:46] 10netops, 10Operations: connectivity issues between several hosts on asw2-b-eqiad - https://phabricator.wikimedia.org/T201039 (10ayounsi) Still no good for me (at least between prometheus1004 and es1014). Provided all the requested info to Juniper and their answer so far is "bounce the port" which solved the... [17:01:31] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['cp5011.eqsin.wmnet'] ``` and were **ALL** successful. [17:07:11] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on neodymium.eqiad.wmnet for hosts: ``` cp5012.eqsin.wmnet ``` The log can be found in `/var/log/wmf-auto-reimage/201808... [17:49:21] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['cp5012.eqsin.wmnet'] ``` Of which those **FAILED**: ``` ['cp5012.eqsin.wmnet'] ``` [17:49:49] ah, my clever workaround worked only partially :) [17:54:08] 10netops, 10DC-Ops, 10Operations, 10cloud-services-team: Refresh switch ports descriptions for recently renamed cloud servers - https://phabricator.wikimedia.org/T201444 (10RobH) [18:42:58] 10Traffic, 10Operations, 10Goal, 10Patch-For-Review: Begin execution of non-forward-secret ciphers deprecation - https://phabricator.wikimedia.org/T192555 (10Jdforrester-WMF) Is this now Resolved? [18:44:58] 10Traffic, 10Analytics, 10Operations: The WMF-Last-Access Set-Cookie header should follow RFC 2965 syntax rather than the pre-RFC Netscape format - https://phabricator.wikimedia.org/T147967 (10Jdforrester-WMF) [18:45:04] 10Traffic, 10Operations, 10Browser-Support-Internet-Explorer, 10Patch-For-Review, 10User-notice: Removing support for DES-CBC3-SHA TLS cipher (drops IE8-on-XP support) - https://phabricator.wikimedia.org/T147199 (10Jdforrester-WMF) [18:45:12] 10Traffic, 10Analytics, 10Operations: The WMF-Last-Access Set-Cookie header should follow RFC 2965 syntax rather than the pre-RFC Netscape format - https://phabricator.wikimedia.org/T147967 (10Jdforrester-WMF) [18:45:16] 10Traffic, 10Operations, 10Browser-Support-Internet-Explorer, 10Patch-For-Review, 10User-notice: Removing support for DES-CBC3-SHA TLS cipher (drops IE8-on-XP support) - https://phabricator.wikimedia.org/T147199 (10Jdforrester-WMF) [18:45:48] 10Traffic, 10Analytics, 10Operations: The WMF-Last-Access Set-Cookie header should follow RFC 2965 syntax rather than the pre-RFC Netscape format - https://phabricator.wikimedia.org/T147967 (10Jdforrester-WMF) >>! In T147967#2710596, @BBlack wrote: > I'd suggest blocking this on the seemingly-unrelated T1471... [19:33:38] 10netops, 10Operations, 10Patch-For-Review: Evaluate NetBox as a Racktables replacement & IPAM - https://phabricator.wikimedia.org/T170144 (10Dzahn) [19:34:21] 10netops, 10Operations, 10Patch-For-Review: Evaluate NetBox as a Racktables replacement & IPAM - https://phabricator.wikimedia.org/T170144 (10Dzahn) [19:37:12] 10netops, 10Operations, 10Patch-For-Review: Evaluate NetBox as a Racktables replacement & IPAM - https://phabricator.wikimedia.org/T170144 (10Dzahn) Subtask to setup backups is now resolved. Incl. testing restore of files from Bacula console back to both netmon servers and dropping the psql database for netb... [20:06:51] 10Traffic, 10netops, 10Operations, 10Patch-For-Review: eqiad row D switch upgrade - https://phabricator.wikimedia.org/T172459 (10Cmjohnson) [21:19:10] 10Traffic, 10Operations, 10ops-eqiad, 10Patch-For-Review: cp1080 uncorrectable DIMM error slot A5 - https://phabricator.wikimedia.org/T201174 (10Cmjohnson) Created a self dispatch with Dell for a new DIMM. You have successfully submitted request SR977877163.