[03:48:22] 10Traffic, 10Operations, 10ops-eqiad: cp1077 is unreachable - https://phabricator.wikimedia.org/T238289 (10Vgutierrez) [04:07:28] 10Traffic, 10Operations, 10ops-eqiad: cp1077 is unreachable - https://phabricator.wikimedia.org/T238289 (10Vgutierrez) Nothing on the logs as well, this looks awfully familiar to T237348 and T238032 [09:36:19] 10Traffic, 10Operations, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp3058.esams.wmnet'] ` The log can be found in `/var/log/wm... [09:36:45] 10Traffic, 10Operations: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10Vgutierrez) [09:37:35] 10Traffic, 10Operations: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10Vgutierrez) [09:37:38] 10Traffic, 10Operations, 10ops-eqiad: cp1077 is unreachable - https://phabricator.wikimedia.org/T238289 (10Vgutierrez) [09:37:41] 10Traffic, 10Operations, 10ops-esams: cp3065 crashed - https://phabricator.wikimedia.org/T238032 (10Vgutierrez) [09:37:44] 10Traffic, 10Operations, 10ops-esams: cp3057 is unreachable - https://phabricator.wikimedia.org/T237348 (10Vgutierrez) [09:38:34] 10Traffic, 10Operations: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10Vgutierrez) p:05Triage→03High [09:56:30] 10Traffic, 10Operations: ats-tls shows spikes on H/2 recv settings bad param errors - https://phabricator.wikimedia.org/T238307 (10Vgutierrez) [09:57:29] 10Traffic, 10Operations: ats-tls shows spikes on H/2 recv settings bad param errors - https://phabricator.wikimedia.org/T238307 (10Vgutierrez) from cp5007: ` vgutierrez@cp5007:~$ journalctl -u trafficserver-tls --since="7days ago" |grep "settings bad param" |cut -f1-2 -d' ' |uniq -c 21 Nov 07 55 Nov... [09:58:31] 10Traffic, 10Operations: ats-tls shows spikes on H/2 recv settings bad param errors - https://phabricator.wikimedia.org/T238307 (10Vgutierrez) It looks like our ATS build is missing https://github.com/apache/trafficserver/pull/5636 [10:02:06] 10Traffic, 10Operations, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp3058.esams.wmnet'] ` The log can be found in `/var/log/wm... [10:18:28] 10Traffic, 10Operations: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10MoritzMuehlenhoff) I tried to narrow this down a bit, but no real luck: - These haven't seen a microcode update yet (and the previous microcode update round dates back quite a while). - All of th... [10:19:49] 10Traffic, 10Operations: debmonitor TLS termination - https://phabricator.wikimedia.org/T238200 (10ema) 05Open→03Resolved a:03ema TLS termination configured on port 7443: ` $ curl -v https://debmonitor.wikimedia.org:7443/login/ --resolve debmonitor.wikimedia.org:7443:10.64.32.62 2>&1 | grep '< HTTP' < HT... [10:28:27] 10Traffic, 10Operations: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10MoritzMuehlenhoff) cp1077 might also be a totally different issue than cp3* (which are from a the same model/generation/ordering batch ; in kern.log on cp1077 there's two oopses from Nov 5, it's no... [10:36:11] 10Traffic, 10Operations: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp3058.esams.wmnet'] ` and were **ALL** successful. [11:28:56] 10Traffic, 10Operations: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10Peachey88) [11:32:12] 10Traffic, 10Operations, 10ops-eqiad: cp1077 is unreachable - https://phabricator.wikimedia.org/T238289 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez Tracking the issue on the parent task: T238305 [11:32:14] 10Traffic, 10Operations: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10Vgutierrez) [11:32:27] 10Traffic, 10Operations: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10Vgutierrez) [11:32:29] 10Traffic, 10Operations, 10ops-esams: cp3065 crashed - https://phabricator.wikimedia.org/T238032 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez Tracking the issue on the parent task: T238305 [11:40:07] 10Traffic, 10Operations, 10Patch-For-Review: ats-tls shows spikes on H/2 recv settings bad param errors - https://phabricator.wikimedia.org/T238307 (10ema) p:05Triage→03Normal [12:21:43] gehel: hello! [12:22:11] just to confirm: do we really also want to get rid of geoshape rate limiting here? https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/545723/ [12:47:57] ema: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/550826/ ? [12:48:12] (the problem observation comes from our new turnilo queries for TLS stats) [12:48:36] I think that patch doesn't have any unintended consequences, but it's VCL so :P [12:48:47] ema: checking [12:50:07] ema: [12:50:35] ema: maybe not, we added that during the incident, so I added it to the rollback. But it might make sense to keep it. Just in case [12:56:05] ema: CR updated [13:23:54] godog: maybe you know? how do we delete dashboards in grafana? [13:24:31] (and then my next question at some point in the coming days might be: if we stop sending some type of voluminous prometheus data that was driving them, is there a point in try to clean it up or just it age out? [13:24:35] ) [13:25:05] logging into grafana and using the delete button gives a 403, I assume for good reasons :) [13:46:45] bblack: looks reasonable! [13:50:11] bblack: which dashboard ? afaik the delete button should work unless the dashboard is in puppet [13:51:24] the tls dashboards [13:51:39] jbond42: hello! I'm getting some interesting pcc errors for stuff that should Just Work such as https://puppet-compiler.wmflabs.org/compiler1002/19380/ [13:51:41] https://grafana.wikimedia.org/d/000000458/tls-ciphersuite-explorer?orgId=1 [13:51:52] https://grafana.wikimedia.org/d/000000452/tls-ciphers-by-data-center?orgId=1 [13:52:06] jbond42: does this ring any bells? [13:52:11] looking [13:52:46] ema that looks like you are missing the secret file in the labs repo [13:53:34] jbond42: I can't read. Thanks! [13:53:43] godog: ^ [13:53:54] np [13:54:26] bblack: bizzarre, no that should work, ok if I try deleting them too ? [13:55:13] sure [13:56:29] the plot thickens, I hit delete then grafana complained with a red popup, deleting again worked [13:56:38] and deleting the other one worked on first try [13:56:56] heh ok [13:57:05] I saw the red popup with a 403 inside it the first time and just stopped [13:57:26] I can see the rabbit hole from here though, I don't think it is wise for me to go in now :) [13:57:49] basically the data's misleading/bad in them anyways, and we have better stuff now via real analytics [13:58:08] and we're about to start taking actions based on that data, and I don't want someone then pulling up links to the bad version of the data to argue with us about what's happening :) [13:59:01] hehe totally [13:59:24] is there a (native?) way to export turnilo results/dashboards ? [14:04:07] 10Traffic, 10Operations, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp3060.esams.wmnet'] ` The log can be found in `/var/log/wm... [14:39:33] 10Traffic, 10Operations: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp3060.esams.wmnet'] ` and were **ALL** successful. [14:47:41] godog: right side top of page, there is an export button that exports to CSV [14:47:58] it's like the "share" thing on mobile :P [14:55:34] screenshot :) [15:08:23] sukhe: hah! TIL, thanks [15:45:12] 10Wikimedia-Apache-configuration, 10Operations, 10serviceops, 10Patch-For-Review: Build a black-box httpd testing framework - https://phabricator.wikimedia.org/T236699 (10crusnov) We use `scap` to deploy Netbox, a possible use-case would be to run httpbb as the last step to verify that apache is configured... [15:47:56] 10Wikimedia-Apache-configuration, 10Operations, 10serviceops, 10Patch-For-Review: Build a black-box httpd testing framework - https://phabricator.wikimedia.org/T236699 (10crusnov) >>! In T236699#5664011, @crusnov wrote: > We use `scap` to deploy Netbox, a possible use-case would be to run httpbb as the las... [16:47:42] 10Traffic, 10Operations: Renew and deploy GlobalSign unified cert (2019) - https://phabricator.wikimedia.org/T237650 (10RobH) [17:52:40] 10Traffic, 10DC-Ops, 10Operations, 10ops-esams: cp3056 hardware issue - https://phabricator.wikimedia.org/T236497 (10RobH) Please note part of the ePSA tool is checking the SEL. So the SEL has to be cleared before running the test. @bblack let me know this server had issues with the storage ssd needing r... [18:09:22] 10Traffic, 10DC-Ops, 10Operations, 10ops-esams: cp3056 hardware issue - https://phabricator.wikimedia.org/T236497 (10RobH) No errors in quick test, full testing is in progress. [18:39:06] 10Traffic, 10DC-Ops, 10Operations, 10ops-esams: cp3056 hardware issue - https://phabricator.wikimedia.org/T236497 (10RobH) a:05RobH→03BBlack ` All tests passed. Validation Code : 84413 ` So all testing has passed. I've gone ahead and powered down the host. Not sure on next steps, will need to sync... [19:56:24] 10Traffic, 10Core Platform Team, 10Operations, 10Wikimedia-General-or-Unknown, 10User-DannyS712: Pages whose title ends with semicolon (;) are intermittently inaccessible - https://phabricator.wikimedia.org/T238285 (10matmarex) I ran `curl -I "https://ban.wikipedia.org/wiki/Mal:;"` in a loop for a while... [20:05:27] 10Traffic, 10Core Platform Team, 10Operations, 10Wikimedia-General-or-Unknown, 10User-DannyS712: Pages whose title ends with semicolon (;) are intermittently inaccessible - https://phabricator.wikimedia.org/T238285 (10Ladsgroup) I think this has to do something from differences between ATS and varnish no... [20:34:49] 10Traffic, 10Core Platform Team, 10Operations, 10Wikimedia-General-or-Unknown, 10User-DannyS712: Pages whose title ends with semicolon (;) are intermittently inaccessible - https://phabricator.wikimedia.org/T238285 (10BBlack) >>! In T238285#5665013, @Ladsgroup wrote: > I think this has to do something wi... [21:26:30] 10Traffic, 10netops, 10Operations, 10Patch-For-Review: Anycast recdns - https://phabricator.wikimedia.org/T186550 (10Jgreen) [21:26:52] 10Traffic, 10Operations, 10SRE-tools, 10Goal, and 3 others: Automate generation of Management DNS records from Netbox - https://phabricator.wikimedia.org/T233183 (10BBlack) Seems sane! The only thing I'm a little iffy about iis from the "SHA1 written to etcd" onwards. I'm not sure it's a bad approach, bu... [21:30:28] 10Traffic, 10Operations, 10SRE-tools, 10Goal, and 3 others: Automate generation of Management DNS records from Netbox - https://phabricator.wikimedia.org/T233183 (10crusnov) >>! In T233183#5665281, @BBlack wrote: > Seems sane! The only thing I'm a little iffy about iis from the "SHA1 written to etcd" onwa... [21:43:08] 10Traffic, 10DC-Ops, 10Operations, 10ops-esams: cp3056 hardware issue - https://phabricator.wikimedia.org/T236497 (10BBlack) a:05BBlack→03RobH I don't think there's anything else we can do either. We can't keep it alive booted into an OS for very long before we get a Linux kernel crash in the network... [22:16:22] 10Traffic, 10Operations: Renew and deploy GlobalSign unified cert (2019) - https://phabricator.wikimedia.org/T237650 (10Seb35) Thanks for the detailled explanation, it’s interesting. I have no further details from the user I reported the issue.