[08:40:11] moritzm: I' [08:40:16] hehe :) [08:40:31] moritzm: I've finished upgrading/rebooting all lvs hosts, spares included [08:41:10] great, thanks [10:23:51] merged the chacha poly rename patch (https://gerrit.wikimedia.org/r/#/c/398311/) it is now called CHACHA20-POLY1305-SHA256 in prometheus: https://grafana.wikimedia.org/dashboard/db/prometheus-tls-ciphersuite-explorer?panelId=6&fullscreen&orgId=1&from=1516005705026&to=1516011708576 [11:11:39] gilles: varnishslowlog merged, there seems to be a bug when working on request headers https://phabricator.wikimedia.org/P6585 [11:13:30] the traceback also ends up in logstash, one message per line. See program:varnishslowlog [11:26:17] gilles: https://gerrit.wikimedia.org/r/404279 [11:49:19] ema: have you already deployed that fix? [11:49:38] gilles: yes, same thing happens with respheader too [11:49:52] I was prepping a similar patch unless [11:49:59] you have objections [11:49:59] yeah sounds fine [11:50:09] I was just wondering why we were still seeing the same kind of output [11:51:20] gilles: https://gerrit.wikimedia.org/r/404282 [11:51:54] 10HTTPS, 10Operations, 10Parsoid, 10VisualEditor: Parsoid, VisualEditor not working with SSL / HTTPS - https://phabricator.wikimedia.org/T178778#3900319 (10Deskana) [11:53:17] an example of empty ReqHeader that I've seen was 'ghostery-antitracking' [11:53:24] splitagain=['ghostery-antitracking:'] [11:53:56] yeah that makes sense [11:54:06] it's perfectly valid both ways [11:54:36] just didn't run into that case on vagrant I guess ess [11:59:43] yeah with actual production traffic it's much easier to find corner cases :) [12:00:06] forcing a puppet run on cache hosts, let's see if there are still issues now [12:03:59] what did you set the slowness threshold to? [12:04:41] gilles: default, 60.0 [12:07:28] I'll temporarily set it to 1.0 on cp3043 to see if data gets sent to logstash as expected [12:10:08] more issues it seems [12:10:23] gilles: ? [12:10:35] oh no wait it's the old ones [12:10:42] wrong range, my bad [12:10:53] gilles: so, program should always be set to 'varnishslowlog' right? [12:11:15] I'm not seeing any logs from cp3043, while I've lowered the threshold to 1.0 so we should get something [12:11:27] CGroup: /system.slice/varnish-frontend-slowlog.service [12:11:27] ├─4105 python /usr/local/bin/varnishslowlog --varnishd-instance-name frontend --slow-threshold 1.0 --logstash-server=logstash.svc.eqiad.wmnet:11514 [12:11:30] └─4106 /usr/bin/varnishlog -q ReqMethod ne "PURGE" and Timestamp:Fetch[3] > 1.000000 -T 600 -n frontend [12:11:53] and indeed running that varnishlog command I do get some output almost immediately [12:12:25] yes, program should be varnishslowlog [12:17:18] have you tried running the whole varnishslowlog command manually outside of systemd? [12:17:34] yeah I think there might be something wrong with logstash.LogstashHandler [12:17:44] running the command with stdout logger works just fine [12:22:50] but it was sending data before [12:22:58] when it was erroring [12:23:23] yes, when raising exceptions it was [12:23:55] I've now tried adding this on cp3043 right after self.logger.addHandler(handler): self.logger.info('logger test: info message') [12:24:15] that also doesn't end up in logstash though [12:25:56] could be the wrong format, that's bit me before [12:26:09] logstash will ignore that silently [12:27:51] on vagrant the configuration that works is: [12:27:56] udp { [12:27:56] port => <%= @port %> [12:27:57] codec => json [12:27:59] } [12:28:25] there are also filters that can mess things up by rewriting fields [12:31:19] mmh [12:31:27] reverting the manual changes to cp3043 and going for lunch now [12:32:12] gilles: meanwhile, let me know if you can think of anything we should try! [12:36:10] ema: yeah I'll compare with thumbor, since that works [12:45:20] not sure it matters, but in the thumbor config the port is an integer [12:45:36] here I assume it ends up being a string, since it's taken from the command [12:45:58] but that's unlikely to be the problem since the exceptions found their way just fine [12:47:57] I think I've got it, at least I've figured out what's different with thumbor [12:48:10] patch incoming [12:53:57] I'm going to go pick up Aaron from the airport soon, but I think this is worth a try [12:54:07] https://gerrit.wikimedia.org/r/#/c/404288/ [12:55:39] if I remember correctly it's because the json endpoint in production on that udp port is configured with type "logback" [12:55:59] setting a different type means the message disappears into oblivion [12:56:33] so I use the tags instead, and we have to avoid some overzealous host rewrite filter which is set up for elasticsearch I believe [13:56:11] gilles: yeah that looks good. I've tried sending a test message with tags=['thumbor'] and that ends up on logstash, while tags=['varnishslowlog'] does not. Merging [13:59:36] gilles: nice, the first message went through already. Search for 'logger_name:varnishslowlog' :) [14:01:05] all the structured information we were expecting is there too (eg: request-Accept-Encoding and so forth) [14:02:18] excellent [14:03:51] on that note, I'm going to show Aaron around town, I'll only be online late tonight [14:03:52] ttyl [14:04:14] see you! [17:46:03] 10Traffic, 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Add Prometheus client support for varnish/statsd metrics daemons - https://phabricator.wikimedia.org/T177199#3901174 (10fgiunchedi) [17:48:54] 10Traffic, 10Operations, 10Goal, 10User-fgiunchedi: Deprecate python varnish cachestats - https://phabricator.wikimedia.org/T184942#3901180 (10fgiunchedi) p:05Triage>03Normal [17:49:49] 10Traffic, 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Add Prometheus client support for varnish/statsd metrics daemons - https://phabricator.wikimedia.org/T177199#3901192 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi [21:39:37] 10netops, 10Operations, 10fundraising-tech-ops: switch network port 2/0/3 (frdb1003) back to administration-vlan - https://phabricator.wikimedia.org/T184723#3901504 (10ayounsi) 05Open>03Resolved a:03ayounsi Done! ``` [edit interfaces interface-range vlan-fundraising] - member "ge-[0-1]/0/3"; [edit i... [21:45:58] 10Traffic, 10Operations, 10Patch-For-Review: Content purges are unreliable - https://phabricator.wikimedia.org/T133821#3901513 (10zhuyifei1999) [21:51:43] 10Traffic, 10DNS, 10Operations: en.wiki domain owned by us, but isn't hosted by us?? - https://phabricator.wikimedia.org/T167060#3316252 (10Krenair) wikibooks.wiki too - https://meta.wikimedia.org/wiki/Requests_for_comment/Domain_parking [23:41:46] 10Domains, 10Traffic, 10DNS, 10Operations, and 2 others: en.wiki domain owned by us, but isn't hosted by us?? - https://phabricator.wikimedia.org/T167060#3901643 (10Peachey88) p:05Low>03Triage Resetting priority for re-triage by ops on-call. Redirecting users though a random AWS account when they hit...