[02:41:11] bblack: https://gerrit.wikimedia.org/r/#/c/182306/3 (see my last comment) [04:15:31] 10Traffic, 10Varnish, 10Librarization, 10MediaWiki-extensions-CentralNotice, and 3 others: Split GeoIP into a new component - https://phabricator.wikimedia.org/T102848#2584958 (10Krinkle) [07:50:57] fyi, cp4013 had a somewhat broken wireshark installation consisting of a mix of packages from jessie and jessie-backports, I've removed those [07:51:59] generally I recommend to just dump with tcpdump (which we have in standard_packages) and then analyse the dump file externally with wireshark [07:52:40] 10netops, 06Operations: Network ACL rules to allow traffic from Analytics to Production for port 9160 - https://phabricator.wikimedia.org/T138609#2585165 (10elukey) Vacation times delayed this task, sorry :) I am almost positive that these rules can be removed, but I'd like a final confirmation from @JAlleman... [08:06:29] likewise lvs2006 [08:52:25] 10Wikimedia-Apache-configuration, 06Operations, 06Performance-Team, 07HHVM, 07Wikimedia-log-errors: Fix Apache proxy_fcgi error "Invalid argument: AH01075: Error dispatching request to" (Causing HTTP 503) - https://phabricator.wikimedia.org/T73487#2585291 (10elukey) Summary and news: 1) Thanks to @Joe a... [10:31:44] cp4006 looking good [11:10:27] ema: first upload server with v4? [11:10:51] elukey: 2nd! :) [11:10:56] \o/ [11:11:11] oozie checks are fine, didn't see any issue so far, gooooooood [11:11:21] (varnishkafka timeouts, etc..) [11:12:21] nice [11:12:47] I checked the config to make sure that the timeout was 700 seconds [11:12:57] (plus we filter Upgrade req, etc..) [11:31:13] 10Wikimedia-Apache-configuration, 06Operations, 06Performance-Team, 07HHVM, 07Wikimedia-log-errors: Fix Apache proxy_fcgi error "Invalid argument: AH01075: Error dispatching request to" (Causing HTTP 503) - https://phabricator.wikimedia.org/T73487#2585498 (10hashar) Congratulations ! Might want a low pr... [13:16:36] paravoid: nice work on the SSL check stuff :) [13:17:18] I have one more pending [13:17:28] to check both RSA and ECDSA [13:17:35] but neon's Net::SSLeay is too old :( [13:17:37] moritzm: yeah there are a couple of other cp machines that have some minor package issues too. they've been installed a long time now and we've put various packages on them as parts of experiments and readonly things :/ One in particular (I forget which now) is in a strange state with the experimental repo on a number of packages too. [13:18:13] at some point it would be nice to roll through some refreshing-reinstalls on them just to clean up [13:18:55] (maybe we should be doing that on some regular schedule all the time? we could call it cache reinstall monday or something and do one machine a week all the time, and they'd never live past ~4 months or so) [13:19:47] it's sort of like a very gentle and slow chaosmonkey :) [13:21:19] ema: looks good on cp4006 yeah. I see the same basic impact on the caching graphs as before. the fallout of persistent 2/6 on v4 in terms of vslp seems like it's probably fine for the weekend. [13:22:28] it may be causing a slight trade of hit-local for hit-remote in codfw as a result of ulsfo->codfw vslp interference, but it's not enough to worry much about [13:23:29] which in turn does sort of give us an automatic plan for how to go through DCs as we do them all: after all ulsfo are converted, codfw should be next, then esams, then eqiad. [13:24:52] one interesting bit from ganglia so far, is cp400[56] have higher cpu utilization [13:25:41] the other 4 are ~88% avg cpu idle, 56 are ~83% [13:25:55] not awful, but it is visible [13:27:07] the increase seems to all be in user%, so I don't think it's directly related to -sfile. probably just the logic and/or locking of core varnish has gotten more complex for the CPU with the internal front/back split, default streaming, more 304 opportunity, etc [13:28:13] it could also be jemalloc too now that I think about it. It might be worth doing a quick depooled restart of both varnishds on one of the ulsfo v3 upload nodes today to confirm about jemalloc impact (I assume most haven't restarted for that yet) [13:29:22] also, I think the varnishd ganglia stats are missing on all the upload v3 in ganglia for comparison (which is often the case, sadly). Will try restarting ganglia-monitor on them for comparisons over the weekend. [13:33:07] ah maybe I was wrong about ganglia, just different stat names perhaps on some thing [13:52:32] bblack: yes on v4 the metrics are prefixed with MAIN [13:52:43] some are probably renamed as well [13:54:50] the higher cpu usage seem to be due to the python programs depending on varnishlog.py [13:56:05] huh, interesting [13:56:40] we've seen something similar with misc T137114 [13:56:42] T137114: Scripts depending on varnishlog.py maxing out CPU usage on cache_misc - https://phabricator.wikimedia.org/T137114 [13:56:44] I wonder if that's intrinsic to how VSL clients operate under v4, or just inefficiency in our particular current implementation [13:57:23] ah, do we lack grouping on the upload-specific ones? [13:57:36] not sure, in upload we're not maxing out really [13:58:53] and we do group by default in varnishlog4.py [13:59:02] https://gerrit.wikimedia.org/r/#/c/293530/ [13:59:50] if not grouping: [13:59:50] # Use request grouping by default. T137114 [13:59:51] parsed_args += ["-g", "request"] [13:59:53] yeah [14:00:01] T137114: Scripts depending on varnishlog.py maxing out CPU usage on cache_misc - https://phabricator.wikimedia.org/T137114 [14:03:29] yeah I guess there are many other inefficiencies in varnishlog4.py that we didn't really notice on maps and upload [14:03:45] s/upload/misc/ [14:18:49] we do get a few Log overrun/Log reacquired errors [14:23:04] 10Wikimedia-Apache-configuration, 06Operations, 06Performance-Team, 07HHVM, 07Wikimedia-log-errors: Fix Apache proxy_fcgi error "Invalid argument: AH01075: Error dispatching request to" (Causing HTTP 503) - https://phabricator.wikimedia.org/T73487#2585992 (10elukey) 05Open>03Resolved [14:26:40] 10Wikimedia-Apache-configuration, 06Operations: Remove apache error log filters after all the migration - https://phabricator.wikimedia.org/T144005#2586001 (10elukey) [14:30:06] 10Wikimedia-Apache-configuration, 06Operations: Remove apache error log blacklist in Logstash's config - https://phabricator.wikimedia.org/T144005#2586037 (10elukey) [14:34:25] so varnishapi.py has been modified upstream https://github.com/xcir/python-varnishapi/blob/master/src/varnishapi.py [14:34:42] I've tried the latest version on cp4006, restarting only varnishreqstat [14:34:58] one CPU core maxed out [14:35:14] so yeah, let's not just import the new varnishapi.py version :) [14:38:45] worth to file a bug upstream? [14:38:56] the guy was responsive enough IIRC [14:39:06] yep [14:39:58] so there's certainly a performance regression between our varnishapi.py and the latest upstream version [14:40:57] and also clearly higher CPU usage under load between our v3 and v4 varnishlog.py stuff [14:43:20] not bad enough to justify downgrading cp4005 and cp4006 I'd say, though [14:46:22] +1, we can live with it [15:53:43] 10Wikimedia-Apache-configuration, 06Operations: Remove apache error log blacklist in Logstash's config - https://phabricator.wikimedia.org/T144005#2586386 (10elukey) https://gerrit.wikimedia.org/r/306943 [15:58:35] 10Wikimedia-Apache-configuration, 06Operations, 06Performance-Team, 07HHVM, and 2 others: Fix Apache proxy_fcgi error "Invalid argument: AH01075: Error dispatching request to" (Causing HTTP 503) - https://phabricator.wikimedia.org/T73487#2586403 (10elukey) As FYI we'll be able to merge the above patch only... [16:17:39] 10Traffic, 06Operations, 13Patch-For-Review: Support TLS chacha20-poly1305 AEAD ciphers - https://phabricator.wikimedia.org/T131908#2586464 (10BBlack) Following up a little further on the AES256 arguments: one possible counter-argument is that our current-best (and most popular by far) key exchange primitive... [16:39:11] 10Traffic, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 06Operations, and 3 others: CN: Stop using the geoiplookup HTTPS service (always use the Cookie) - https://phabricator.wikimedia.org/T143271#2586527 (10AndyRussG) Should we make a task for talking to third-party users? Announce the cha... [16:46:03] 10Traffic, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 06Operations, and 3 others: CN: Stop using the geoiplookup HTTPS service (always use the Cookie) - https://phabricator.wikimedia.org/T143271#2586538 (10BBlack) Monday morning's fine. IIRC from the meeting, the number of 3rd party wiki... [16:53:38] 10Domains, 10Traffic, 06Operations, 10Wikimedia-Site-requests, 13Patch-For-Review: Private wiki for Project Grants Committee - https://phabricator.wikimedia.org/T143138#2586592 (10Dereckson) @Mjohnson_WMF what are your plans to pick a name and so we can move this task forward? Past experience shows the... [16:59:03] 10Traffic, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 06Operations, and 3 others: CN: Stop using the geoiplookup HTTPS service (always use the Cookie) - https://phabricator.wikimedia.org/T143271#2586632 (10AndyRussG) >>! In T143271#2586538, @BBlack wrote: > Monday morning's fine. Great,... [17:00:59] 10Domains, 10Traffic, 06Operations, 10Wikimedia-Site-requests, 13Patch-For-Review: Private wiki for Project Grants Committee - https://phabricator.wikimedia.org/T143138#2586639 (10Mjohnson_WMF) My preference for the URL remains https://projectcom.wikimedia.org. As an abbreviated handle, projectcom match... [17:13:21] 10Traffic, 10MediaWiki-extensions-UniversalLanguageSelector, 06Operations, 13Patch-For-Review: ULS GeoIP should not use meta.wm.o/geoiplookup - https://phabricator.wikimedia.org/T143270#2586690 (10BBlack) [17:14:21] 10Traffic, 06MediaWiki-Stakeholders-Group, 06Operations, 07Developer-notice, and 2 others: Get rid of geoiplookup service - https://phabricator.wikimedia.org/T100902#2586708 (10BBlack) [17:14:25] 10Traffic, 10MediaWiki-extensions-UniversalLanguageSelector, 06Operations, 13Patch-For-Review: ULS GeoIP should not use meta.wm.o/geoiplookup - https://phabricator.wikimedia.org/T143270#2562516 (10BBlack) 05Open>03Resolved a:03BBlack Ditto the other ticket: This seems resolved with the new release y... [17:20:02] 10Traffic, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 06Operations, and 3 others: CN: Stop using the geoiplookup HTTPS service (always use the Cookie) - https://phabricator.wikimedia.org/T143271#2586731 (10BBlack) Relevant: https://wikiapiary.com/wiki/Extension:CentralNotice [17:32:13] 10Domains, 10Traffic, 06Operations, 10Wikimedia-Site-requests, 13Patch-For-Review: Private wiki for Project Grants Committee - https://phabricator.wikimedia.org/T143138#2586805 (10Dereckson) @Platonides fine for you? [18:05:39] 10Traffic, 10MediaWiki-General-or-Unknown, 06Operations, 06Services: Investigate query parameter normalization for MW/services - https://phabricator.wikimedia.org/T138093#2586982 (10Anomie) If we have anything that does multiple correlated `foo[]`-style arrays, that would be ordering dependent. For example... [18:22:55] 10Traffic, 10MediaWiki-General-or-Unknown, 06Operations, 06Services: Investigate query parameter normalization for MW/services - https://phabricator.wikimedia.org/T138093#2587043 (10BBlack) >>! In T138093#2586982, @Anomie wrote: > If we have anything that does multiple correlated `foo[]`-style arrays, that... [18:28:57] 10Traffic, 10MediaWiki-General-or-Unknown, 06Operations, 06Services: Investigate query parameter normalization for MW/services - https://phabricator.wikimedia.org/T138093#2587063 (10Anomie) >>! In T138093#2587043, @BBlack wrote: > So long as our sorter preserves the relative order of duplicated parameter n... [18:34:50] 10Traffic, 10MediaWiki-General-or-Unknown, 06Operations, 06Services: Investigate query parameter normalization for MW/services - https://phabricator.wikimedia.org/T138093#2587076 (10BBlack) Apparently v3's libvmod-boltsort and v4's std.querysort() both fail to do so. Nothing about the handling of duplicat...