[07:46:52] 10Traffic, 10Operations, 10Patch-For-Review, 10Performance-Team (Radar): Upgrade to Varnish 5 - https://phabricator.wikimedia.org/T168529#3689628 (10ema) [07:46:55] 10Traffic, 10Operations, 10Patch-For-Review, 10Performance-Team (Radar): Upgrade cache_misc to Varnish 5 - https://phabricator.wikimedia.org/T177233#3689625 (10ema) 05Open>03Resolved a:03ema Done. [10:11:15] 10Traffic, 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review, 10Services (watching): RESTBase logs disappeared from logstash - https://phabricator.wikimedia.org/T178078#3689875 (10Pchelolo) The logs are back where they belong, so I guess the ticket can be resolved. Thank you @fgiunchedi [13:20:31] bblack: thoughts on https://gerrit.wikimedia.org/r/#/c/384520/? I've replaced Systemd::Service with Base::Service_unit to take advantage of the systemd_override option [13:23:16] ema: +1, but be careful that it deploys sanely (as in, that the transition between the different ways of declaring the service doesn't cause a hard service restart everywhere as it deploys) [13:24:08] ok, will do [13:25:42] the cp1008 testing noise on your sec stuff has been interesting for vhtcpd testing too :) [13:26:17] ha! [13:26:19] I still haven't found any real bugs, but I realized a few more pragmatic improvements after watching the queue sizes grow like crazy during the varnishd downtimes [13:30:46] bblack: there's a puppetfail for cp1063, a weird one [13:30:51] rror: Could not retrieve catalog from remote server: Error 400 on SERVER: Attempt to assign to a reserved variable name: 'trusted' on node cp1063.eqiad.wmnet [13:31:39] looking [13:31:54] I checked 1x upload + 1x text after my merge, though [13:32:42] apparently random server-side fluke [13:33:43] oh, fun [13:34:27] I was in fact surprised that only one host would fail [13:35:15] yes, usually anytime I merge a patch, a minimum of ~40 hosts fail :) [13:37:14] haha [13:37:46] that error message really does raise a few WTF-worthy questions [13:38:40] clocks apparently [13:38:51] huh? [13:38:54] https://tickets.puppetlabs.com/browse/PDB-949 [13:39:15] manifests/realm.pp: $pieces = split($trusted['certname'], '[.]') [13:39:21] that seems to be our only reference to $trusted [13:39:55] I have seen that error before for an entirely unrelated change as well [13:53:57] bblack: tested in labs, patch not merged yet: indeed the different way of defining the service does restart varnish. Thanks for pointing out that that might happen :) [14:03:40] ha! [14:03:55] `refresh => false` is the right thing [15:00:59] disabling puppet on all cache hosts to err on the side of caution and merging [15:02:03] ema: slightly related, I'm "waiting" for patches to cumin aliases files to simplify your day-to-day operations ;) [15:02:48] volans: oh yeah, I'll look into that :) [15:03:22] happy to help/review ofc [15:06:32] no issues on cp3008, the puppet run went fine and the daemons did not get restarted [15:06:58] I'm restarting them now for the systemd override changes to take effect [15:14:47] testing on cp3040 now (text) [15:18:57] cp3040 also looks good [15:22:14] cp1072 now (upload) [15:22:55] testing varnish-{backend,frontend}-restart to avoid future surprises, vcl reloads too [15:25:23] yeah all good, enabling puppet again [15:26:45] tomorrow I'll step through the restarts [15:31:15] yeah good to leave it a day just in case [15:31:50] varnish is too big to properly audit, but there's always the odd chance of "oh this situation only arises like once a day, where we make this weird syscall that's no longer allowed" [18:40:12] 10Traffic, 10Operations, 10ops-ulsfo: rack/setup/install cp40(29|3[123]).ulsfo.wmnet - https://phabricator.wikimedia.org/T178423#3691718 (10RobH) [18:40:34] 10Traffic, 10Operations, 10ops-ulsfo: rack/setup/install cp40(29|3[012]).ulsfo.wmnet - https://phabricator.wikimedia.org/T178423#3691739 (10RobH) [18:44:21] 10Traffic, 10Operations, 10hardware-requests, 10ops-ulsfo, 10Patch-For-Review: Decom cp4005-8,13-16 (8 nodes) - https://phabricator.wikimedia.org/T176366#3691743 (10RobH) All of these hosts have now been wiped and bios/drac reset. Stalling this until we're ready to sell off the entire batch of old ulsfo... [18:44:33] 10Traffic, 10Operations, 10hardware-requests, 10ops-ulsfo, 10Patch-For-Review: Decom cp4005-8,13-16 (8 nodes) - https://phabricator.wikimedia.org/T176366#3691744 (10RobH) [19:26:45] 10Traffic, 10Operations, 10ops-ulsfo: rack/setup/install cp40(29|3[012]).ulsfo.wmnet - https://phabricator.wikimedia.org/T178423#3691925 (10RobH) [20:28:47] 10Traffic, 10Operations, 10ops-ulsfo: rack/setup/install cp40(29|3[012]).ulsfo.wmnet - https://phabricator.wikimedia.org/T178423#3692052 (10RobH) I still need to flash firmware on drac/bios before OS install on all of these. [20:58:09] 10Traffic, 10Operations, 10ops-ulsfo: rack/setup/install lvs400[567].ulsfo.wmnet - https://phabricator.wikimedia.org/T178436#3692125 (10RobH) [21:15:36] 10Traffic, 10Operations, 10ops-ulsfo: rack/setup/install lvs400[567].ulsfo.wmnet - https://phabricator.wikimedia.org/T178436#3692212 (10RobH) [22:17:33] https://observatory.mozilla.org/analyze.html?host=en.wikipedia.org [22:37:56] 10Traffic, 10Operations, 10Wikidata, 10wikiba.se, and 2 others: [Task] move wikiba.se webhosting to wikimedia misc-cluster - https://phabricator.wikimedia.org/T99531#3692384 (10Dzahn) [22:38:15] 10Traffic, 10Operations, 10Wikidata, 10wikiba.se, and 2 others: [Task] move wikiba.se webhosting to wikimedia misc-cluster - https://phabricator.wikimedia.org/T99531#1293753 (10Dzahn) [22:39:48] 10Traffic, 10Operations, 10Wikidata, 10wikiba.se, and 2 others: [Task] move wikiba.se webhosting to wikimedia misc-cluster - https://phabricator.wikimedia.org/T99531#1293753 (10Dzahn) You might hate me for this question but have you considered using an org domain instead of the "se"? Also, i wonder if we... [22:45:45] 10HTTPS, 10Traffic, 10Operations, 10Phabricator, 10procurement: wmfusercontent.org SSL cert expires 2017-11-22 - https://phabricator.wikimedia.org/T178443#3692404 (10Dzahn) [22:50:07] 10Traffic, 10Operations, 10Wikimedia-Planet, 10procurement: *.planet.wikimedia.org SSL cert expires 2017-11-22 - https://phabricator.wikimedia.org/T178444#3692421 (10Dzahn) [22:55:26] 10Traffic, 10Operations, 10Wikimedia-Planet, 10procurement: *.planet.wikimedia.org SSL cert expires 2017-11-22 - https://phabricator.wikimedia.org/T178444#3692442 (10Dzahn) Actually it looks like everything is a single cert now, so that makes this a duplicate of T178443 and it really just means that globa... [22:55:57] 10HTTPS, 10Traffic, 10Operations, 10Phabricator, 10procurement: wmfusercontent.org SSL cert expires 2017-11-22 - https://phabricator.wikimedia.org/T178443#3692404 (10Dzahn) Actually it looks like everything is a single cert now, so that makes this a duplicate of T178444 and it really just means that glob... [23:04:47] 10Traffic, 10Operations, 10ops-ulsfo: cp4026 memory error - https://phabricator.wikimedia.org/T178011#3692490 (10RobH) 05Open>03Resolved memory replacement complete and system returned to service