[04:01:39] 10Traffic, 10Discovery-Search, 10SRE, 10observability: flapping icinga Letsencrypt TLS cert alerts around renewal time - https://phabricator.wikimedia.org/T293826 (10Legoktm) p:05Low→03High [04:01:45] 10Traffic, 10Discovery-Search, 10SRE, 10observability: flapping icinga Letsencrypt TLS cert alerts around renewal time - https://phabricator.wikimedia.org/T293826 (10Legoktm) lists.wm.o is serving both the old and new cert, just like the blog post mentioned: {P17556} [04:06:04] 10Traffic, 10Discovery-Search, 10SRE, 10observability: flapping icinga Letsencrypt TLS cert alerts around renewal time - https://phabricator.wikimedia.org/T293826 (10Legoktm) After restarting apache, and ~50 requests later, I'm only getting the new certificate. Marking as high because the monitoring is pic... [04:15:58] 10Traffic, 10Discovery-Search, 10SRE, 10observability: flapping icinga Letsencrypt TLS cert alerts around renewal time - https://phabricator.wikimedia.org/T293826 (10Legoktm) Looking in syslog doesn't show much interesting AFAICT, except it seems like puppet/acme-chief is reloading apache every 2-3 days (n... [07:29:17] 10Traffic, 10Discovery-Search, 10SRE, 10observability, 10Patch-For-Review: flapping icinga Letsencrypt TLS cert alerts around renewal time - https://phabricator.wikimedia.org/T293826 (10elukey) Maybe it is totally off, but I saw that the cloudelastic nodes use the `tlsproxy::localssl` define (via `elasti... [08:25:18] reverting the experiment on cp3062 (T293879) [08:25:19] T293879: varnishmtail metric loss due to mtail not reading from pipe fast enough - https://phabricator.wikimedia.org/T293879 [08:27:56] (VarnishTrafficDrop) firing: 69% GET drop in text@esams during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/VarnishTrafficDrop - https://grafana.wikimedia.org/d/000000541/varnish-caching-last-week-comparison?viewPanel=5&var-cluster=text&var-site=esams - https://alerts.wikimedia.org [08:35:27] 10netops, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10ops-codfw: DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL )) - https://phabricator.wikimedia.org/T283582 (10cmooney) @papaul @dzahn I had a go at enumerating the iDrac firmware version on o... [08:37:56] (VarnishTrafficDrop) resolved: 67% GET drop in text@esams during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/VarnishTrafficDrop - https://grafana.wikimedia.org/d/000000541/varnish-caching-last-week-comparison?viewPanel=5&var-cluster=text&var-site=esams - https://alerts.wikimedia.org [08:38:56] (VarnishTrafficDrop) firing: 66% GET drop in text@esams during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/VarnishTrafficDrop - https://grafana.wikimedia.org/d/000000541/varnish-caching-last-week-comparison?viewPanel=5&var-cluster=text&var-site=esams - https://alerts.wikimedia.org [08:43:11] (VarnishTrafficDrop) resolved: 68% GET drop in text@esams during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/VarnishTrafficDrop - https://grafana.wikimedia.org/d/000000541/varnish-caching-last-week-comparison?viewPanel=5&var-cluster=text&var-site=esams - https://alerts.wikimedia.org [09:17:53] 10Traffic, 10Observability-Logging, 10SRE, 10User-ema: varnishmtail metric loss due to mtail not reading from pipe fast enough - https://phabricator.wikimedia.org/T293879 (10ema) By giving a very large amount - 3G instead of the default 80M - of `vsl_space` to cp3062, the issue happens less often but still... [10:34:56] (VarnishTrafficDrop) firing: 68% GET drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/VarnishTrafficDrop - https://grafana.wikimedia.org/d/000000541/varnish-caching-last-week-comparison?viewPanel=5&var-cluster=text&var-site=codfw - https://alerts.wikimedia.org [10:39:56] (VarnishTrafficDrop) resolved: 67% GET drop in text@codfw during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/VarnishTrafficDrop - https://grafana.wikimedia.org/d/000000541/varnish-caching-last-week-comparison?viewPanel=5&var-cluster=text&var-site=codfw - https://alerts.wikimedia.org [10:40:27] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10ayounsi) a:05ayounsi→03aborrero Thanks for the doc, some follow up questions to make sure I understand it properly. > However, li... [12:37:55] 10Traffic, 10Observability-Logging, 10SRE, 10User-ema: varnishmtail metric loss due to mtail not reading from pipe fast enough - https://phabricator.wikimedia.org/T293879 (10ema) I've tried using a separate mtail instance with a subset of the scripts used by the production instance, namely: - varnisherror... [16:42:44] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: Q1:(Need By: TBD) rack/setup/install cloudswift100[12] - https://phabricator.wikimedia.org/T289882 (10aborrero) I take it the main concern here is allocating a public IPv4 address, which is a scarce resource, no? It seems we have a rese... [18:27:46] 10netops, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10ops-codfw: DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL )) - https://phabricator.wikimedia.org/T283582 (10Papaul) [18:31:29] 10netops, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10ops-codfw: DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL )) - https://phabricator.wikimedia.org/T283582 (10Dzahn) >>! In T283582#7447115, @cmooney wrote: > There are many more in eqiad, bu... [18:32:48] 10netops, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10ops-codfw: DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL )) - https://phabricator.wikimedia.org/T283582 (10Dzahn) @cmooney many thanks for the txt :) [18:36:06] 10netops, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10ops-codfw: DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL )) - https://phabricator.wikimedia.org/T283582 (10Dzahn) So the ones alerting in eqiad are one case of 2.30.30.30 and one case of "... [20:10:57] (VarnishTrafficDrop) firing: 62% GET drop in text@eqsin during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/VarnishTrafficDrop - https://grafana.wikimedia.org/d/000000541/varnish-caching-last-week-comparison?viewPanel=5&var-cluster=text&var-site=eqsin - https://alerts.wikimedia.org [20:15:57] (VarnishTrafficDrop) firing: (2) 68% GET drop in text@eqsin during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/VarnishTrafficDrop - https://alerts.wikimedia.org [20:25:57] (VarnishTrafficDrop) resolved: (2) 68% GET drop in text@eqsin during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/VarnishTrafficDrop - https://alerts.wikimedia.org