[08:26:19] https://phabricator.wikimedia.org/T209810 <- the task to have swift access logs longer than 3/4 days [12:44:43] 10Traffic, 10Operations, 10Patch-For-Review: tagged_interface sometimes exceeds IFNAMSIZ - https://phabricator.wikimedia.org/T209707 (10ema) p:05Triage>03Normal [13:41:12] 10Traffic, 10Multimedia, 10Operations: Wikipedia sends WebP thumbnails even when the browser does not support it - https://phabricator.wikimedia.org/T209805 (10ema) [13:46:24] 10Traffic, 10Operations: INMARSAT geolocates to the UK, leading to requests going to esams - https://phabricator.wikimedia.org/T209785 (10BBlack) When looking at the latest MaxMind data, it locates this network as being in New Zealand, which we map to ulsfo as first choice, and esams as the last-resort choice.... [14:06:45] 10Traffic, 10Multimedia, 10Operations: Wikipedia sends WebP thumbnails even when the browser does not support it - https://phabricator.wikimedia.org/T209805 (10TheDJ) This is due to experiments with {T27611} [14:10:04] is https://phabricator.wikimedia.org/T99531 on your radar? [14:12:55] 10Traffic, 10Multimedia, 10Operations: Wikipedia sends WebP thumbnails even when the browser does not support it - https://phabricator.wikimedia.org/T209805 (10TheDJ) Opera 11.60 release in 2011-12-06 (.64 are just security updates). I guess in theory we could blacklist the old Opera UAs in the varnish confi... [14:20:24] 10Traffic, 10Multimedia, 10Operations, 10Patch-For-Review: Wikipedia sends WebP thumbnails even when the browser does not support it - https://phabricator.wikimedia.org/T209805 (10ema) p:05Triage>03Normal [14:25:42] 10Traffic, 10Multimedia, 10Operations, 10Patch-For-Review: Wikipedia sends WebP thumbnails even when the browser does not support it - https://phabricator.wikimedia.org/T209805 (10TheDJ) Note: The chromium/webkit versions of Opera after opera 15 use the OPR string to identify Opera. These browsers likely D... [14:53:09] I think I got bit by this https://stackoverflow.com/questions/34938706/varnish-4-does-not-honor-cache-control-must-revalidate. Context is https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/474684/1/modules/releases/templates/apache.conf.erb. Url is https://releases.wikimedia.org/charts/index.yaml [14:54:06] I was hoping to avoid setting max-age and s-maxage and rely on the last-modified header [14:59:30] ah dammit it, that only applies to resources that are already stale [15:03:31] ema: the icinga checks are triggering false positives in ATS nodes icinga? [15:06:52] vgutierrez: ah! [15:07:38] yes that's because we check for minimum one and maximum one running traffic_server process, but one of the new checks runs traffic_server [15:10:55] 10Traffic, 10Multimedia, 10Operations, 10Patch-For-Review: Wikipedia sends WebP thumbnails even when the browser does not support it - https://phabricator.wikimedia.org/T209805 (10Gilles) I guess this means that these older Opera versions send request headers stating that they accept webp when they're in f... [15:13:04] 10Traffic, 10Multimedia, 10Operations, 10Patch-For-Review: Wikipedia sends WebP thumbnails even when the browser does not support it - https://phabricator.wikimedia.org/T209805 (10TheDJ) @gilles see my note in T209805#4758174 v11 probably supports some early versions of them, but not all. [15:14:06] vgutierrez: something like https://gerrit.wikimedia.org/r/474706 ? [15:15:06] that or check that the ATS port is open by the traffic_server daemon, and then you don't care about traffic_server flags [15:15:15] but yes :) [15:18:27] 10Traffic, 10Multimedia, 10Operations, 10Patch-For-Review: Wikipedia sends WebP thumbnails even when the browser does not support it - https://phabricator.wikimedia.org/T209805 (10Gilles) Indeed. I've installed 11.64 and even the lossy ones we generate don't work. And it does advertise webp support in requ... [15:22:30] 10Traffic, 10Multimedia, 10Operations, 10Patch-For-Review: Wikipedia sends WebP thumbnails even when the browser does not support it - https://phabricator.wikimedia.org/T209805 (10Gilles) I've just verified the current stable Opera out of curiosity and it does (unsurprisingly) render our webps correctly. [15:22:57] 10Traffic, 10Multimedia, 10Operations, 10Performance-Team, 10Patch-For-Review: Wikipedia sends WebP thumbnails even when the browser does not support it - https://phabricator.wikimedia.org/T209805 (10Gilles) a:03ema [15:30:31] ema: let me know when it's rolled out everywhere, since I have 11.64 to test with [15:39:59] ah, seems to work now (hitting esams) [15:40:51] 10Traffic, 10Multimedia, 10Operations, 10Performance-Team: Wikipedia sends WebP thumbnails when Opera claims to support it but lies - https://phabricator.wikimedia.org/T209805 (10Gilles) 05Open>03Resolved [15:41:12] 10Traffic, 10Multimedia, 10Operations, 10Performance-Team: Wikipedia sends WebP thumbnails when Opera claims to support it but lies - https://phabricator.wikimedia.org/T209805 (10Gilles) Verified the fix on enwiki front page using Opera 11.64 [15:43:11] gilles: nice! [15:45:58] (fully deployed now) [15:49:22] Krenair: do we already have a task for deploying a certcentral managed certificate in librenms as initial test? [15:49:28] Krenair: I do remember discussing this here [15:49:36] afternoon vgutierrez [15:49:41] hi :D [15:49:56] there is a ticket about domains to start with [15:50:12] yes, https://phabricator.wikimedia.org/T207050 [15:50:14] I believe I posted a list of in scope domains and it got assigned to you [15:51:06] Do you want subtasks for each domain (or pair of domains) there vgutierrez ? [15:51:17] at least I'm creating one for the firsto ne [15:51:19] *first one [15:51:42] I'm suspecting that we will find some dragons [15:52:24] heh [15:52:34] the actual deployment of certs is as yet untested in prod [15:52:51] 10Certcentral, 10Traffic, 10Operations: Deploy a certcentral managed TLS certificate for librenms - https://phabricator.wikimedia.org/T209856 (10Vgutierrez) p:05Triage>03Normal [15:52:54] 10Certcentral, 10Traffic, 10Operations: Deploy a certcentral managed TLS certificate for librenms - https://phabricator.wikimedia.org/T209856 (10Vgutierrez) [15:54:49] Krenair: so.. deploying a certcentral certificate has two steps, 1. configure the new certificate in certcentral, 2. add the cercentral::cert resource in the target system(s) [15:55:03] I guess those steps should be two separate commits, right? [15:55:15] eh [15:55:40] I imagine puppet will error on the target systems until certcentral is ready for it [15:55:50] don't know how big a problem that is [15:55:58] yes, that's why I suggest splitting it in two commits [15:55:59] safest option is to do it separately in order [15:56:05] as you say [15:56:13] otherwise we should silence the target host(s) in icinga to avoid alerts [15:57:03] I don't know how widely icinga distributes alerts for puppet errors, probably just to IRC and the web UI [15:57:57] 10Traffic, 10Maps, 10Operations, 10Reading-Infrastructure-Team-Backlog (Kanban): Decide on Cache-Control headers for map tiles - https://phabricator.wikimedia.org/T186732 (10Jhernandez) Do we want to keep this one open to wait for the stretch migration and checking on the eventbus load, or should we spin u... [16:05:30] Krenair: from https://puppet-compiler.wmflabs.org/compiler1002/13584/netmon1002.wikimedia.org/ it looks like we forgot to add the new certificate flavours to the certcentral::cert resource? [16:05:57] Krenair: fullchain / only chain / cert only should be deployed IIRC [16:08:16] certcentral makes crt, chain.crt and chained.crt available [16:08:46] certcentral::cert calls the crt public.pem, and the chained.crt fullchain.pem [16:08:54] it doesn't bother with chain.crt [16:09:19] IIRC this was to support stuff that required configuring the leaf (? terminology) cert and chain in separate files [16:09:26] right [16:10:19] we could make it pull chain.crt from certcentral too [16:11:00] dunno if librenms requires it or not [16:11:41] ooh [16:11:45] modules/librenms/templates/apache.conf.erb [16:11:49] SSLCertificateFile /etc/acme/cert/librenms.crt [16:11:49] SSLCertificateChainFile /etc/acme/cert/librenms.chain.crt [16:12:01] sigh :) [16:12:11] I guess that we need to deploy the chain as well [16:12:19] vgutierrez, let's change the puppet resource to include chain.crt and also use the same names as certcentral? [16:12:24] yeah that's why you need all these variants heh [16:12:34] Krenair: ok [16:12:45] vgutierrez, wanna do that or shall I? [16:12:58] I can append that as the first commit in my branch [16:13:01] ok [16:13:10] tbh it's puppet so you can just self-approve anyway :p [16:13:18] luckily you don't have to deploy it and configure it in one step anyways [16:13:42] you can do the CC part and validate that it drops the right files in place, before doing the commit that switches apache config [16:13:52] yeah [16:14:07] Krenair: in any case, a +1 from another people with knowledge about the issue is welcomed and desired :) [16:14:16] s/people/person/ [16:14:23] ok :) [16:32:57] Krenair: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/474730/ looking good now? [16:34:01] vgutierrez, lgtm [16:34:09] great! [16:47:49] 10Traffic, 10Operations: ATS production-ready as a backend cache layer - https://phabricator.wikimedia.org/T207048 (10ema) [16:47:54] 10Traffic, 10Operations, 10Patch-For-Review: Define and deploy Icinga checks for ATS backends - https://phabricator.wikimedia.org/T204209 (10ema) 05Open>03Resolved [16:48:08] 10Traffic, 10Operations: ATS production-ready as a backend cache layer - https://phabricator.wikimedia.org/T207048 (10ema) [16:52:23] well... certcentral1001 got the librenms certificates on the first attempt, no errors [16:53:33] okay good [16:53:39] 2001? [16:54:00] 2001 is a passive one now [16:54:09] oh right so it doesn't do this anymore [16:54:11] so in the next puppet run it will be synced over rsync [16:54:13] indeed [16:54:19] so next step, have the target server pull the cert [16:54:28] yes... let's merge that change [16:55:02] I just remembered [16:55:32] puppet will run on the target node and the puppetmaster will store the resource that gives it access [16:56:00] it isn't until puppet runs on the certcentral host that certcentral knows its authorised [16:56:25] so puppet may fail for a while until both puppet runs have occurred in order :/ [16:56:55] should eventually sort itself out [16:59:12] it shouldn't be clientrun -> ccrun -> clientrun though, I think? [16:59:30] I think once the puppetization is committed, it's just cc running the agent, then the client. [16:59:47] but I'm not 100% sure [17:00:02] you're right bblack, client --> cc1001 --> client [17:00:30] that's pretty much what I said [17:00:46] oh wow [17:00:50] Notice: /Stage[main]/Profile::Librenms/Certcentral::Cert[librenms]/File[/etc/centralcerts/librenms.ec-prime256v1.key]/ensure: defined content as '{md5}fc0bb69bf0d6ef33a7e5be692f59a4a1' [17:00:51] \o/ [17:01:02] yay [17:01:34] it worked at the first attemp [17:02:10] as expected, on the first run in netmon1002, puppet got some 403 attempting to fetch the certificates [17:02:10] on the second one, it worked like a charm [17:02:16] aaaand I'm late for the meeting [17:02:16] right [17:06:54] https://gerrit.wikimedia.org/r/474743 [17:07:30] 10Traffic, 10Operations, 10ops-eqsin: cp5001 unreachable since 2018-07-14 17:49:21 - https://phabricator.wikimedia.org/T199675 (10RobH) The old ticket was too old, and new ticket 19131684 has been opened. I'm working this (sending over all the old info and logs) and will schedule another onsite attempt. [17:18:26] 10netops, 10Operations, 10ops-eqiad: Fix missing PDU's for row C eqiad in netbox - https://phabricator.wikimedia.org/T208091 (10ayounsi) [17:20:39] 10netops, 10Operations, 10Patch-For-Review: IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 noisy alert - https://phabricator.wikimedia.org/T205829 (10ayounsi) Reply from the RIPE: > I see that you have found the problem as my graphs are looking normal now. From what I can gather, it was packet loss on IPv6 cau... [17:23:10] 10Traffic, 10Operations: ATS production-ready as a backend cache layer - https://phabricator.wikimedia.org/T207048 (10ema) [17:23:12] 10Traffic, 10Operations, 10Patch-For-Review: ATS: log inspection at runtime - https://phabricator.wikimedia.org/T204225 (10ema) 05Resolved>03Open There's a problem with fifo-log-demux reading from the pipe, reopening! [17:24:44] 10Certcentral, 10Traffic, 10Operations, 10Patch-For-Review: Deploy a certcentral managed TLS certificate for librenms - https://phabricator.wikimedia.org/T209856 (10Vgutierrez) looking good: `vgutierrez@neodymium:~$ sudo cumin netmon1002.wikimedia.org,netmon2001.wikimedia.org 'sha256sum /etc/centralcerts/l... [17:29:01] 10netops, 10Operations: Access to network devices for Riccardo (volans) - https://phabricator.wikimedia.org/T208726 (10RobH) removing the project for access requests, since htis is now a netops thing. [17:37:59] Krenair: I'm wondering if before switching librenms to the certcentral issued certificate we should provide a icinga SSL certificate check for it [17:38:33] 10netops, 10Operations: asw2-a-eqiad FPC2 reboot - https://phabricator.wikimedia.org/T209588 (10Cmjohnson) @ayounsi, power cables are fine, both power supplies are green. There wasn't anyone in the cage at the time of the reboot. [17:41:29] vgutierrez, did we do anything like that with the old puppetisation? [17:41:47] apparently not [17:41:53] well then don't worry about it [17:42:03] sounds like a good task but not a blocker for this work [17:42:42] as we're not introducing a regression on that front [17:42:50] yeah, from some talks with bblack I understood that that was the case, but it doesn't come from LE puppetization [17:44:19] I thought we did have some kind of icinga check on existing LEs, for expiry? [17:44:59] right here we are [17:45:02] modules/librenms/manifests/web.pp [17:45:06] monitoring::service { 'https': [17:45:10] check_command => 'check_ssl_http_letsencrypt!librenms.wikimedia.org', [17:45:44] ah [17:45:44] that's the file which contains the reference to letsencrypt::cert::integrated [17:45:55] (which it seems is not the same file we are adding certcentral::cert to for some reason) [17:46:09] but it's not direct integration, just all these hosts "happen" to also have something like that separately puppetized [17:46:35] yes [17:46:42] but yeah, later we should make it automagic I think [17:47:00] the existing check_ssl_http_letsencrypt should continue to function for now where it's already defined [17:47:13] yeah [17:47:40] ok [17:48:19] yeah it's just a customization of normal check_ssl with much shorter warn/crit times [17:48:22] command_line $USER1$/check_ssl --warning 7 --critical 3 -H $HOSTADDRESS$ -p 443 --cn $ARG1$ [17:49:04] we should test it from certcentral side point of view too, just checking expiration dates of certificates in /var/lib/certcentral/live_certs [17:49:37] I'd solve the pre-staging / clock-skew problems first, because it will probably change the shape of related things [17:49:47] (re: "live" certs) [17:50:09] ack [17:50:14] 10netops, 10Operations, 10ops-eqiad: Fix missing PDU's for row C eqiad in netbox - https://phabricator.wikimedia.org/T208091 (10Cmjohnson) Physically it was impossible to get to the s/n without removing them from the mounts. ayounsi was able to get them a different way. Asset tags ps1-c1 wmf7459 ps1-c2 wm... [17:53:12] bblack: for the initial set of hosts I think we agreed that the 1 hour granted was enough IIRC [17:53:34] right [17:53:56] we have everything we need at this point for them, just a matter of rolling it out now [17:54:04] and continuing to debug! [17:54:20] yeah [17:54:32] the other heavy feature bits are more about supporting deploying major certs to the cache clusters [17:54:45] (as opposed to one-off certs for minor technical audiences) [18:32:26] 10netops, 10Operations, 10ops-eqiad: Fix missing PDU's for row C eqiad in netbox - https://phabricator.wikimedia.org/T208091 (10ayounsi) 05Open>03Resolved a:05Cmjohnson>03ayounsi Serial exported from LibreNMS. All 8 PSUs imported in Netbox, as well as their console connections. [22:03:37] bblack: syslog on dns2001 is full of pdns_recursor[807]: Nov 19 22:01:32 Timeout from remote TCP client 208.80.153.72 and pdns_recursor[807]: Nov 19 22:01:34 Timeout from remote TCP client 208.80.153.69 not sure if known or a possible issue [22:06:19] when using varnishlog do you guys have a way of excluding the healthcheck/varnishcheck stuff? [23:08:30] XioNoX: that's pybal probing for DNS health on the TCP port