[01:53:54] mutante: re: cp4027 to 10.64.0.54 on 443 [01:54:07] mutante: what's the hostname that's trying to use? [01:54:25] I'm assuming our TLS certs on the applayer don't provide the SAN with the IP [01:54:33] so that could explain the connectivity issues [01:56:03] the SANs on that IP (at least without SNI) are DNS:rt.discovery.wmnet, DNS:rt.svc.eqiad.wmnet, DNS:rt.svc.codfw.wmnet, DNS:moscovium.eqiad.wmnet [01:56:37] those doesn't look like any hostname that would be used from the Internet [01:56:53] so I'd say that the certificate is missing the public facing hostname used for that service [02:05:36] mutante: from the remap config file... "map http://rt.wikimedia.org https://rt.discovery.wmnet", I'd say rt.wikimedia.org should be on the cert [07:20:58] 10Traffic, 10Operations, 10Wikidata, 10Wikidata-Query-Service, and 2 others: large number of 504 errors from ulsfo - https://phabricator.wikimedia.org/T236500 (10jijiki) [07:46:08] 10Traffic, 10Operations, 10ops-esams: Degraded RAID on cp3048 - https://phabricator.wikimedia.org/T198784 (10Volans) 05Open→03Declined Closing as the host has been decommissioned as part of T236454 [08:23:07] 10Traffic, 10Operations: Elevated 502s observed in ulsfo - https://phabricator.wikimedia.org/T236130 (10ema) 05Open→03Resolved a:03ema We're now returning 403 to those requests. Availability in text@ulsfo looks much better. [08:38:35] 10Traffic, 10Operations, 10CPT Initiatives (Core REST API in PHP), 10Core Platform Team Workboards (Green), 10Patch-For-Review: Implement basic routing for rest.php - https://phabricator.wikimedia.org/T235779 (10ema) >>! In T235779#5604796, @WDoranWMF wrote: > @BBlack @ema would you have anytime to revie... [09:03:12] 10Traffic, 10Operations, 10Wikidata, 10Wikidata-Query-Service, and 3 others: large number of 504 errors from ulsfo - https://phabricator.wikimedia.org/T236500 (10ema) It is envoy here that times out after 15 seconds (CC @Joe). [09:27:11] 10Traffic, 10DC-Ops, 10Operations, 10decommission, 10Patch-For-Review: decommission multatuli - https://phabricator.wikimedia.org/T236489 (10Papaul) [09:27:28] 10Traffic, 10DC-Ops, 10Operations, 10decommission, 10Patch-For-Review: decommission multatuli - https://phabricator.wikimedia.org/T236489 (10Papaul) 05Open→03Resolved complete [09:32:26] 10Traffic, 10Operations, 10ops-esams: rack/setup/install bast3004 - https://phabricator.wikimedia.org/T236394 (10Papaul) [09:37:01] 10Traffic, 10DC-Ops, 10Operations, 10decommission, 10Patch-For-Review: decommission nescio and maerlant - https://phabricator.wikimedia.org/T236452 (10Papaul) [09:37:39] 10Traffic, 10DC-Ops, 10Operations, 10decommission, 10Patch-For-Review: decommission nescio and maerlant - https://phabricator.wikimedia.org/T236452 (10Papaul) 05Open→03Resolved complete [09:45:56] 10Traffic, 10Operations, 10ops-esams: rack/setup/install lvs300[567] - https://phabricator.wikimedia.org/T236294 (10Papaul) @Vgutierrez @BBlack if there is nothing else to do on these servers as far as racking and setting up, can we resolve this task? [09:46:26] 10Traffic, 10DNS, 10Operations, 10ops-esams: rack/setup/install dns300[12] - https://phabricator.wikimedia.org/T236217 (10Papaul) @Vgutierrez @BBlack if there is nothing else to do on these servers as far as racking and setting up, can we resolve this task? [09:58:47] 10Traffic, 10DC-Ops, 10Operations, 10decommission: decommission cp3030-3049 - https://phabricator.wikimedia.org/T236454 (10Papaul) [10:12:43] 10Traffic, 10DC-Ops, 10Operations, 10decommission: decommission cp3030-3049 - https://phabricator.wikimedia.org/T236454 (10Papaul) [10:12:59] 10Traffic, 10DC-Ops, 10Operations, 10decommission: decommission cp3030-3049 - https://phabricator.wikimedia.org/T236454 (10Papaul) 05Open→03Resolved complete [10:18:31] <_joe_> bblack, ema now the preceding functionality of being able to set default weights for new nodes is restored, by using 'sudo initialize' when you finish setting up a server [10:19:04] <_joe_> it also allows you to change the defaults between clusters, which you couldn't before [10:51:38] 10Traffic, 10Operations, 10Wikidata, 10Wikidata-Query-Service, and 3 others: large number of 504 errors from ulsfo - https://phabricator.wikimedia.org/T236500 (10ema) @Bugreporter timeout raised to 65 seconds, this should fix the 504 errors. [11:03:19] _joe_: cool, thanks! [11:41:22] 10Traffic, 10Analytics, 10Analytics-Kanban, 10Operations, 10observability: Publish tls related info to webrequest via varnish - https://phabricator.wikimedia.org/T233661 (10JAllemandou) From looking at the dashboards, it looks like the entire set of values we wasnt to collect is what is currently display... [12:13:31] 10Traffic, 10Operations, 10CPT Initiatives (Core REST API in PHP), 10Core Platform Team Workboards (Green), 10Patch-For-Review: Implement basic routing for rest.php - https://phabricator.wikimedia.org/T235779 (10WDoranWMF) Awesome, thanks @ema ! [12:37:16] 10Traffic, 10Operations, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp5007.eqsin.wmnet'] ` The log can be found in `/var/log/wm... [13:35:35] 10Traffic, 10Operations, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp5007.eqsin.wmnet'] ` and were **ALL** successful. [13:49:55] 10Traffic, 10DC-Ops, 10Operations, 10decommission, 10Patch-For-Review: decommission lvs300[1234] - https://phabricator.wikimedia.org/T236451 (10Papaul) [13:50:22] 10Traffic, 10DC-Ops, 10Operations, 10decommission, 10Patch-For-Review: decommission lvs300[1234] - https://phabricator.wikimedia.org/T236451 (10Papaul) 05Open→03Resolved complete [15:08:34] 10Traffic, 10Operations, 10observability: global HTTP (un)availability number, as reported in Frontend Traffic dashboard, is bogus - https://phabricator.wikimedia.org/T234567 (10CDanis) 05Open→03Resolved a:03CDanis [15:26:59] 10Wikimedia-Apache-configuration, 10Operations, 10serviceops: Build a black-box httpd testing framework - https://phabricator.wikimedia.org/T236699 (10RLazarus) [15:36:40] 10Traffic, 10Operations, 10observability: 'LVS connections' graph on Load Balancers dashboard takes a rate of a gauge - https://phabricator.wikimedia.org/T236700 (10CDanis) [15:40:01] 10Traffic, 10Operations, 10observability: 'LVS connections' graph on Load Balancers dashboard takes a rate of a gauge - https://phabricator.wikimedia.org/T236700 (10CDanis) p:05Triage→03Normal [15:55:55] 10Traffic, 10Operations, 10Wikidata, 10Wikidata-Query-Service, and 2 others: large number of 504 errors from ulsfo - https://phabricator.wikimedia.org/T236500 (10sbassett) >>! In T236500#5609046, @Bugreporter wrote: > @jijiki The Custom Policy does not make sense since #Traffic is currently a public-joinab... [16:38:14] vgutierrez: thanks, you were right of course. cert issue [16:44:51] 10netops, 10Operations, 10ops-esams: bast3004 can't reach mgmt networks - https://phabricator.wikimedia.org/T236686 (10Dzahn) [16:50:49] 10netops, 10Operations, 10ops-esams: bast3004 can't reach mgmt networks - https://phabricator.wikimedia.org/T236686 (10BBlack) a:03BBlack I'll poke at this today since Arzhel's not here (may take a couple hours, squeezing it around meetings) [17:14:03] 10netops, 10Operations, 10ops-esams: bast3004 can't reach mgmt networks - https://phabricator.wikimedia.org/T236686 (10BBlack) 05Open→03Resolved Turns out it was simpler than I thought! Should be done here, re-open if it's still not working. [18:24:11] 10netops, 10Operations, 10ops-esams: bast3004 can't reach mgmt networks - https://phabricator.wikimedia.org/T236686 (10Dzahn) confirmed working now. Thanks! ` [bast3004:~] $ ping -c1 -w1 cp5007.mgmt.eqsin.wmnet PING cp5007.mgmt.eqsin.wmnet (10.132.129.107) 56(84) bytes of data. 64 bytes from cp5007.mgmt.eq... [21:21:03] 10Traffic, 10Operations: track NIC firmware version numbers across the fleet - https://phabricator.wikimedia.org/T236744 (10CDanis) [21:30:20] bblack: let me know if you have any thoughts re: https://phabricator.wikimedia.org/T236744 -- I think whatever we choose will be simple to do [22:23:27] cdanis: seems like facter would be more "appropriate" for this kind of data than abusing prometheus stats. My only qualm is it adds to facter run-time bloat, but meh who cares really? [22:38:51] having it in prometheus allows for some nice graphs like https://grafana.wikimedia.org/d/000000556/microcode-updates, though [22:40:04] facter is probably easier to work with, though, as it allows for some Cumin queries what to update for [22:40:59] long term we could also treat the various firmware states as data points for the host tracking in debmonitor [23:06:59] I think I might just implement both, neither are very hard, and I might wind up implementing a textfile-collector-driven-by-systemd-timer Puppet module as a learning exercise [23:07:43] bblack: 'abusing Prometheus status' isn't quite right btw -- the post that technique comes from is the blog of the project lead, and it's also the way that Prometheus exports its own version metadata as a metric :)