[00:44:44] 10Traffic, 10Operations, 10ops-esams, 10Patch-For-Review: rack/setup/install bast3004 - https://phabricator.wikimedia.org/T236394 (10Dzahn) - home dirs copied to individual $user_bast1002.tar.gz files in each user home (where the user exists on both old and new server) so users have their old files if they... [00:50:44] the bastion should theoretically work for installs now. i copied the installer data (and the home dirs, and prometheus) data [00:51:00] i say theoretically because i did not merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/545973 [00:51:21] i think last time i had to ask for ACL changes in network to make it work as DHCP/tftp [00:51:55] and i did not want to just break it and move on.. will merge tomorrow when we can test and people are online [00:52:13] but if you want to merge that and try it out..please do [02:08:10] 10Traffic, 10DC-Ops, 10Operations, 10decommission: decommission lvs300[1234] - https://phabricator.wikimedia.org/T236451 (10BBlack) [02:08:11] 10Traffic, 10DC-Ops, 10Operations, 10decommission: decommission lvs300[1234] - https://phabricator.wikimedia.org/T236451 (10BBlack) [02:11:30] 10Traffic, 10DC-Ops, 10Operations, 10decommission: decommission lvs300[1234] - https://phabricator.wikimedia.org/T236451 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by bblack@cumin1001 for hosts: `lvs[3001-3004].esams.wmnet` - lvs3001.esams.wmnet (**PASS**) - Downtimed host on Icing... [02:43:41] 10Traffic, 10DC-Ops, 10Operations, 10decommission, 10Patch-For-Review: decommission lvs300[1234] - https://phabricator.wikimedia.org/T236451 (10BBlack) [02:43:54] 10Traffic, 10DC-Ops, 10Operations, 10decommission, 10Patch-For-Review: decommission lvs300[1234] - https://phabricator.wikimedia.org/T236451 (10BBlack) a:03Papaul [03:04:38] 10Traffic, 10DC-Ops, 10Operations, 10decommission: decommission nescio and maerlant - https://phabricator.wikimedia.org/T236452 (10BBlack) [03:06:24] 10Traffic, 10DC-Ops, 10Operations, 10decommission: decommission nescio and maerlant - https://phabricator.wikimedia.org/T236452 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by bblack@cumin1001 for hosts: `maerlant.wikimedia.org,nescio.wikimedia.org` - maerlant.wikimedia.org (**PASS**)... [03:27:56] 10Traffic, 10DC-Ops, 10Operations, 10decommission, 10Patch-For-Review: decommission nescio and maerlant - https://phabricator.wikimedia.org/T236452 (10BBlack) [03:28:18] 10Traffic, 10DC-Ops, 10Operations, 10decommission, 10Patch-For-Review: decommission nescio and maerlant - https://phabricator.wikimedia.org/T236452 (10BBlack) a:03Papaul [03:28:39] 10Traffic, 10DC-Ops, 10Operations, 10decommission, 10Patch-For-Review: decommission nescio and maerlant - https://phabricator.wikimedia.org/T236452 (10BBlack) [03:37:36] bblack: do you need some help? [03:41:33] only if you've recently become a licensed therapist :) [03:43:06] I'd fix myself first then [03:46:01] the old esams recdns and lvs are decommed now [03:46:10] the cp pooling/depooling stuff is complete [03:46:19] awesome :) [03:46:19] need to launch the decom/shutdown stuff for the old cps [03:47:56] so cp3030 - cp3049 are ready to be decomm'ed? [03:47:59] the main thing I'm annoyed at, is I put the wrong partman recipe in for lvs300[567], but I've noted it to fix later (well soon, but later) [03:48:16] they have no raid because of it, but are otherwise running fine, we can reimagine them one at a time later [03:48:21] *reimage [03:48:40] hmm ok maybe on Monday or you wanna do it before the weekend? [03:48:53] I can take care of that now if needed [03:56:05] and yeah cp3030-49 are ready for decom (well, some in that range are already missing, it's kind of a mess) [03:56:21] "ready for decom" meaning they're depooled out of service [03:56:34] I still need to make the ticket and run the cookbook and clean up puppet, etc [03:56:53] (doing that now) [04:02:38] 10Traffic, 10DC-Ops, 10Operations, 10decommission: decommission cp3030-3049 - https://phabricator.wikimedia.org/T236454 (10BBlack) [04:04:23] 10Traffic, 10DC-Ops, 10Operations, 10decommission: decommission cp3030-3049 - https://phabricator.wikimedia.org/T236454 (10BBlack) [04:05:33] 10Traffic, 10DC-Ops, 10Operations, 10decommission: decommission cp3030-3049 - https://phabricator.wikimedia.org/T236454 (10BBlack) [04:34:47] vgutierrez: maybe I'm losing it.... cp3030 is gone from confctl live data. if I look at lvs3005 (active text LVS in esams), ipvsadm -Ln shows only the new nodes for port 443 [04:35:03] yet I'm on cp3030 (where puppet needs to stay disabled), double-checking [04:35:19] and I see tons of connections to the ATS port 443 from random public IPs in the ESTABLISHED state in netstat [04:35:34] and this one has been depooled a long time from the LVS point of view (hours ago) [04:36:01] hmm checking [04:38:28] bblack: so atslog-tls isn't showing new requests hitting ats-tls [04:38:52] besides the prometheus-exporter [04:38:56] yeah tcpdump didn't show it either [04:39:06] netstat just has tons of stale ESTABLISHED? [04:39:06] oh and icinga of course :) [04:39:24] cp3032 with nginx doesn't show the same issue [04:39:27] it's real public IPs [04:39:33] hmmm [04:39:47] try e.g. [04:39:48] some timeout isn't behaving as expected apparently [04:39:49] root@cp3030:~# netstat -anp|grep ESTAB|less [04:39:57] yeah yeah, I'm seeing it [04:40:21] I don't think that can even be an applayer issue. the TCP stack in the kernel should've killed those heh [04:40:53] anyways, can't be real users, continue with decom, right? [04:41:00] I'm wondering if those are falling under this timeout, proxy.config.net.default_inactivity_timeout: 86400 [04:41:10] oh maybe heh [04:41:18] let me reduce that to 30 secs on cp3030 [04:41:20] and see what happens [04:41:31] it should be a reloadable config parameter [04:41:37] so no restart needed [04:42:06] ok [04:45:54] it doesn't look like it makes a huge difference [04:46:48] ok [04:46:56] either way I can't see real traffic with tcpdump, assuming ok [04:47:02] yup [04:47:19] I'm wondering if proxy.config.ssl.handshake_timeout_in: 0 could be responsible of that [04:48:23] sigh.. yet another thing to check between ats-tls and nginx \o/ [04:50:40] yah.. something is kinda off regarding ats-tls [04:51:02] cp5007 shows 147k sockets on :443 established and cp5008 just 50k [04:51:42] yet another can of worms :) [04:51:47] 10Traffic, 10DC-Ops, 10Operations, 10decommission, 10Patch-For-Review: decommission cp3030-3049 - https://phabricator.wikimedia.org/T236454 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by bblack@cumin1001 for hosts: `cp[3030,3032-3035].esams.wmnet` - cp3030.esams.wmnet (**PASS**) -... [04:53:31] 10Traffic, 10DC-Ops, 10Operations, 10decommission, 10Patch-For-Review: decommission cp3030-3049 - https://phabricator.wikimedia.org/T236454 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by bblack@cumin1001 for hosts: `cp[3036,3038-3041].esams.wmnet` - cp3036.esams.wmnet (**PASS**) -... [04:54:04] bblack: vgutierrez: so i confirmed ganeti3003 is at "First Puppet run completed" after install_server was switched, looks like it works. though of course we want to know why mgmt is down, will ask papaul tomorrow but for now i gotta go [04:54:13] cool [04:55:05] 10Traffic, 10DC-Ops, 10Operations, 10decommission, 10Patch-For-Review: decommission cp3030-3049 - https://phabricator.wikimedia.org/T236454 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by bblack@cumin1001 for hosts: `cp[3042-3046].esams.wmnet` - cp3042.esams.wmnet (**PASS**) - Down... [04:55:49] "Unable to run wmf-auto-reimage-host: Failed to reboot_host" at the very end.. [04:56:13] 10Traffic, 10DC-Ops, 10Operations, 10decommission, 10Patch-For-Review: decommission cp3030-3049 - https://phabricator.wikimedia.org/T236454 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by bblack@cumin1001 for hosts: `cp[3032,3047,3049].esams.wmnet` - cp3032.esams.wmnet (**FAIL**) -... [05:00:02] 10Traffic, 10Operations: ats-tls shows a huge amount of ESTABLISHED sockets even when the server is depooled - https://phabricator.wikimedia.org/T236458 (10Vgutierrez) [05:00:29] 10Traffic, 10DC-Ops, 10Operations, 10decommission, 10Patch-For-Review: decommission cp3030-3049 - https://phabricator.wikimedia.org/T236454 (10BBlack) a:05BBlack→03Papaul [05:03:09] 10Traffic, 10Operations, 10decommission, 10ops-esams: Decommission esams cache_misc hosts - https://phabricator.wikimedia.org/T208585 (10BBlack) [05:09:59] ok so I think cp decom is in a good place for dcops [05:13:00] cool [05:13:37] aside from whatever's going on with bast3002 (not sure if daniel has more to block on there before papaul can pull it) [05:14:14] for all the other stuff, I think we're down to just "multatuli" (ns2) as the remaining in-service legacy host we care about. [05:14:43] and there's no way I'm fixing that situation with software in time, so pp will have to just move the box physically over to one of the new racks and cable it in so it can keep chugging for now [05:15:04] if he does that first, then he can attack everything in the other racks without much risk [05:15:43] 10Traffic, 10Operations: ats-tls shows a huge amount of ESTABLISHED sockets even when the server is depooled - https://phabricator.wikimedia.org/T236458 (10Vgutierrez) I'm tracking used TCP sockets on eqsin text nodes in https://grafana.wikimedia.org/d/ivPJtZAWz/t236458?orgId=1&from=now-1h&to=now, I've manuall... [05:15:46] and there's the lvs disks thing [05:16:42] I can reimage them if you think it's necessary to do it before the weekend [05:18:12] can you just hit lvs5007? [05:18:24] the secondary? of course [05:18:26] then we know at least one has redundant disks, and it's the backup lvs so it's easy [05:18:32] sure [05:18:58] err [05:19:02] 3007, not 5007 :0 [05:19:05] ahahah [05:19:12] yeah, I understood you :) [05:19:16] ./dev/sda1 314G 1.5G 297G 1% / [05:19:26] ^ that should be an md device if the partman thing works right [05:19:38] I believe I already fixed it in puppet, so should just have to reimage [05:19:51] I'll check that [05:20:07] BTW, thanks for noticing the sockets thingie on cp3030, it looks like I've solved it manually on cp5007: https://grafana.wikimedia.org/d/ivPJtZAWz/t236458?orgId=1&from=now-1h&to=now [05:20:16] I'll track that dashboard and puppetize the setting [05:20:22] heh nice :) [05:20:48] <- sleep [05:20:52] yeah, go rest! [05:31:07] 10Traffic, 10Operations: ats-tls shows a huge amount of ESTABLISHED sockets even when the server is depooled - https://phabricator.wikimedia.org/T236458 (10Vgutierrez) p:05Triage→03Normal [05:37:31] 10Traffic, 10Operations, 10ops-esams: rack/setup/install lvs300[567] - https://phabricator.wikimedia.org/T236294 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on cumin1001.eqiad.wmnet for hosts: ` ['lvs3007.esams.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/2019102... [06:07:04] 10Traffic, 10Operations, 10ops-esams: rack/setup/install lvs300[567] - https://phabricator.wikimedia.org/T236294 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['lvs3007.esams.wmnet'] ` and were **ALL** successful. [06:09:18] lvs3007 is now happy and running with a nice RAID-1: /dev/md0 46G 1.4G 42G 4% / -- md0 : active raid1 sda1[0] sdb1[1] [08:15:54] mutante: if you check the logs you can see that it failed to ssh to reboot the host because of not matching ECDSA key and strict checking [08:19:26] 10Traffic, 10Operations: Move cache text cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231627 (10Vgutierrez) [08:33:25] hi, I'm looking into prometheus and bast3004 FYI [08:42:31] ack [08:42:42] specifically to done one last round of rsync and verify then bast3002 is good for decom from my POV [08:42:48] s/done/do/ [08:44:55] 10Traffic, 10Operations, 10serviceops, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10ema) [09:00:35] 10Traffic, 10Operations, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp4030.ulsfo.wmnet'] ` The log can be found in `/var/log/wm... [09:12:08] 10Traffic, 10Operations, 10ops-esams: rack/setup/install bast3004 - https://phabricator.wikimedia.org/T236394 (10fgiunchedi) Prometheus data sync'd again from bast3002 and copied in place, DNS flipped, Prometheus is live on this host now and not active on bast3002 anymore [09:19:47] 10Traffic, 10Operations: Move cache text cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231627 (10Vgutierrez) [09:35:30] 10Traffic, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 2 others: Ghostscript outputs errors to stdout despite -q, preventing Thumbor from generating some thumbnails properly - https://phabricator.wikimedia.org/T236240 (10Aklapper) >>! In T236240#5601713, @Gilles wrote: > The use of -q means t... [09:38:43] 10Traffic, 10Operations, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp4030.ulsfo.wmnet'] ` and were **ALL** successful. [09:47:05] 10Traffic, 10Operations: ats-tls shows a huge amount of ESTABLISHED sockets even when the server is depooled - https://phabricator.wikimedia.org/T236458 (10Vgutierrez) 05Open→03Resolved [09:47:08] 10Traffic, 10Operations: Move cache text cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231627 (10Vgutierrez) [09:50:53] 10Traffic, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 2 others: Ghostscript outputs errors to stdout despite -q, preventing Thumbor from generating some thumbnails properly - https://phabricator.wikimedia.org/T236240 (10Gilles) Indeed, nice find! Adding `-sstdout=%stderr` fixes the issue. [09:59:02] 10Traffic, 10Operations, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp4031.ulsfo.wmnet'] ` The log can be found in `/var/log/wm... [10:17:51] is the bast3002 decom urgent as in needs to happen today? I can pick that up and continue if needed [10:22:46] 10Traffic, 10Operations: Elevated 502s observed in ulsfo - https://phabricator.wikimedia.org/T236130 (10ema) For the record, the User-Agent causing this is `FortiGate (FortiOS 5.0)`. [10:23:24] lol [10:24:32] godog: I think Daniel is on it [10:26:50] would be great if we can wipe it today indeed [10:27:45] godog: or maybe go ahead if Arzhel's mgmt maintenance is over? [10:28:43] speaking of Fortinet: https://twitter.com/GossiTheDog/status/1187385252559380482 [10:28:50] especially "Fortigate say it’s a feature" [10:30:04] :O [10:30:07] no way [10:31:47] sure I'll take a stab at it [10:35:18] 10Traffic, 10Operations, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp4031.ulsfo.wmnet'] ` and were **ALL** successful. [10:36:12] hi traffic, i have a proposal to make the volatile uri endpoint avalible on both puppet frontends https://gerrit.wikimedia.org/r/c/operations/puppet/+/542922. could someone here comment to make sure the change to the GeoData location wont cause issues [10:37:46] XioNoX: is bast3004 otherwise good to go mgmt network wise ? [10:44:43] IOW can I announce it to ops@ as the bastion to be used now ? [11:35:57] godog: globally yes, I still have to update all the other mr1 so bast3004 can access the other mgmt networks, but shouldn't be a blocker [12:33:32] ema: the ulsfo's you're reimaging - it's also undoing the ack of their expiring globalsign cert, which will eventually alert (it's warning now) [12:46:14] bblack: ah, will ack those. Thanks! [13:03:56] XioNoX: ok! thanks, I'll send out the announcement email now [13:07:46] 10Traffic, 10Operations, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp4032.ulsfo.wmnet'] ` The log can be found in `/var/log/wm... [13:11:55] 10Traffic, 10Operations: Temporarily use ganeti3003 as ns2 authdns - https://phabricator.wikimedia.org/T236479 (10BBlack) p:05Triage→03Normal [13:36:09] 10Traffic, 10Operations, 10Patch-For-Review: Temporarily use ganeti3003 as ns2 authdns - https://phabricator.wikimedia.org/T236479 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by bblack on cumin1001.eqiad.wmnet for hosts: ` ['ganeti3003.esams.wmnet'] ` The log can be found in `/var/log/wmf-aut... [13:37:45] 10Traffic, 10Operations, 10observability: Add ats-tls status and availability graphs to frontend-traffic - https://phabricator.wikimedia.org/T236482 (10ema) [13:37:52] 10Traffic, 10Operations, 10observability: Add ats-tls status and availability graphs to frontend-traffic - https://phabricator.wikimedia.org/T236482 (10ema) p:05Triage→03Normal [13:43:53] 10Traffic, 10Operations, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp4032.ulsfo.wmnet'] ` and were **ALL** successful. [14:00:53] 10Traffic, 10Operations, 10Patch-For-Review: Temporarily use ganeti3003 as ns2 authdns - https://phabricator.wikimedia.org/T236479 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['ganeti3003.esams.wmnet'] ` Of which those **FAILED**: ` ['ganeti3003.esams.wmnet'] ` [14:05:57] 10Traffic, 10Operations, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ema) [14:55:15] 10Traffic, 10DC-Ops, 10Operations, 10decommission: decommission multatuli - https://phabricator.wikimedia.org/T236489 (10BBlack) [15:01:24] 10Traffic, 10DC-Ops, 10Operations, 10decommission, 10Patch-For-Review: decommission multatuli - https://phabricator.wikimedia.org/T236489 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by bblack@cumin1001 for hosts: `multatuli.wikimedia.org` - multatuli.wikimedia.org (**PASS**) - Dow... [15:03:07] 10Traffic, 10DC-Ops, 10Operations, 10decommission, 10Patch-For-Review: decommission multatuli - https://phabricator.wikimedia.org/T236489 (10BBlack) [15:03:38] 10Traffic, 10DC-Ops, 10Operations, 10decommission, 10Patch-For-Review: decommission multatuli - https://phabricator.wikimedia.org/T236489 (10BBlack) [15:03:53] 10Traffic, 10DC-Ops, 10Operations, 10decommission, 10Patch-For-Review: decommission multatuli - https://phabricator.wikimedia.org/T236489 (10BBlack) a:03Papaul [15:08:34] 10netops, 10Operations, 10ops-esams: Setup esams atlas anchor - https://phabricator.wikimedia.org/T174637 (10faidon) [15:15:02] 10Traffic, 10Operations, 10ops-esams: rack/setup/install bast3004 - https://phabricator.wikimedia.org/T236394 (10Dzahn) [15:20:18] 10Traffic, 10Operations, 10ops-esams: rack/setup/install bast3004 - https://phabricator.wikimedia.org/T236394 (10Dzahn) [15:21:28] 10Traffic, 10Operations, 10ops-esams: rack/setup/install bast3004 - https://phabricator.wikimedia.org/T236394 (10Dzahn) 05Open→03Resolved Thank you, @fgiunchedi! It's set to active in Netbox and i tested an install to confirm DHCP/tftpboot is working after that was switched too. Resolving. [15:40:06] 10Traffic, 10DC-Ops, 10Operations, 10ops-esams: cp3056 hardware issue - https://phabricator.wikimedia.org/T236497 (10BBlack) p:05Triage→03Normal [15:48:53] 10Traffic, 10Operations: Elevated 502s observed in ulsfo - https://phabricator.wikimedia.org/T236130 (10colewhite) [[ https://logstash.wikimedia.org/app/kibana#/visualize/create?type=histogram&indexPattern=logstash-*&_g=h@05ebc47&_a=h@29cbfe1 | And it's back! ]] [16:09:52] 10Traffic, 10Operations, 10Wikidata, 10Wikidata-Query-Service: ulsfo returns 504 error (upstream request timeout) for WDQS requests - https://phabricator.wikimedia.org/T236500 (10Bugreporter) [16:17:49] 10Traffic, 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10ops-ulsfo: ulsfo returns 504 error (upstream request timeout) for WDQS requests - https://phabricator.wikimedia.org/T236500 (10Bugreporter) Happens in ulsfo only [21:31:38] 10Traffic, 10FR-Q2-FY2019-20-cleanup-list, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Operations: Geoip lookup - Misidentifying country due to travelling - https://phabricator.wikimedia.org/T175691 (10Volans) I can confirm this as it happened to me today. I'm seeing the fund raising b... [22:42:32] 10Traffic, 10Operations, 10serviceops, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10Dzahn) [23:09:36] i am having trouble with a connection from cp servers to envoy on a misc server. trafficserver log shows "could not connect" but when i use curl to open the exact same URL on the same IP and from the same source.. it works [23:10:20] for example cp4027 to 10.64.0.54 on 443 [23:11:25] what else to check if it works with curl -S (before it was the cert but now that's fixed) [23:13:34] remap.config of trafficserver looks fine and same like for other services where i did get this problem [23:13:38] did not