[07:03:34] <_joe_> FYI, I just saw on icinga cp1062 and cp4026 getting connection refused on port 3128 [07:04:44] could be the twice-a-week restart? [07:05:13] <_joe_> maybe, but it might be worth checking [07:05:24] <_joe_> I don't have time to do it now, just noticed on icinga [07:22:05] indeed, varnish was restarted at 07:02 UTC in cp1062 [07:23:09] same for cp4026 [07:25:23] yo [08:31:36] <_joe_> vgutierrez: yeah it was peculiar to see two servers at the same time [09:42:45] 10Traffic, 10Operations, 10Pybal, 10Patch-For-Review: pybal 1.15.2 dies with obscure errors without python-prometheus-client - https://phabricator.wikimedia.org/T190527#4079882 (10Vgutierrez) pybal metrics package attempts to make prometheus support optional, but it's obviously failing right now. The subm... [10:05:51] 10Traffic, 10Operations, 10Patch-For-Review, 10Performance-Team (Radar): Define turn-up process and scope for eqsin service to regional countries - https://phabricator.wikimedia.org/T189252#4079954 (10BBlack) >>! In T189252#4079025, @Liuxinyu970226 wrote: > What about Antarctica (AQ)? Not in scope here.... [13:41:39] cp1008 has a broken package state: "The following packages have unmet dependencies: varnish-dbg : Depends: varnish (= 5.1.3-1wm6) but 5.1.3-1wm7 is installed" [13:41:58] ok to fix (since that breaks updates of other packages) or does that mess with any manual experiment? [13:42:17] bblack & ema were playing with cp1008 :) [13:44:28] so maybe they're trying some patches included in -wm7 [13:55:23] bblack, ema: so I've two patches to enable a new public service on cache-misc and I'd like to know if A) they are correct and/or anything else is missing, B) I can go ahead and merge the first one and test it before merging the DNS one [13:55:27] director patch: https://gerrit.wikimedia.org/r/#/c/419763/ [13:55:34] DNS patch: https://gerrit.wikimedia.org/r/#/c/419800/ [13:57:07] volans: it's active/active? can send queries to eitehr backend randomly? [13:57:37] yes, it's stateless and queries puppetdb locally in each DC [13:57:52] so I don't see any reason to not have it active/active [13:58:17] ok [13:59:47] volans: seems sane. usually the harder part of such a thing is building a new internal LVS service, but in this case it's just two simple servers :) [14:00:18] yeah and in each DC we have just one, so chatting with go.dog we didn't see any reason to add an LVS entry [14:00:25] to balance just one server [14:00:27] volans: do the director thing and puppetize on all cache_misc, before doing the DNS thing, is usually best [14:01:08] ah I see you already noted that in your self-reviews [14:01:16] yeah, so far I've tested with ssh tunnel, I was thinking to merge the director and test it with hosts file pointing at the misc IP [14:01:36] right [14:01:55] thanks for the review! [14:02:09] in this case since "pass" it may not matter either way. usually the most important reason to do cache_misc before DNS is to avoid caching 404s for the domain [14:02:37] right [14:02:50] err, I guess that still matters in this case, since the "pass" wouldn't be there yet if done backwards :) [14:03:08] eheheh also true :) [14:03:56] about the "pass", we thought it was the right one given is a service behind LDAP auth and the only thing cachable for real would be static files, and not sure if worth anyway [14:04:01] if you have a better suggestion [14:04:05] let us know ;) [14:05:04] if a service is authenticated it's tricky to do anything but "pass" and not mix up cached contents between unauth/auth/different-users [14:05:23] (without putting in custom per-service VCL to handle the auth mechanism/cookies/whatever) [14:05:41] for low-volume cache_misc stuff we don't really care about cache efficiency anyways, just standardization [14:05:46] (of the entrypoint) [14:06:14] agree, glad we picked the right one :) [14:11:24] * volans surprised on the amount of diff in the compiler for just this small addition, it triggers a lot of things :) [14:15:19] 10Traffic, 10Operations: Removing support for AES128-SHA TLS cipher - https://phabricator.wikimedia.org/T147202#4080675 (10BBlack) Since the last stats update ~6 months ago above, the overall percentage for AES128-SHA has continued its decline, from ~0.220% to ~0.0846% . We'll be looking to plan and start an... [14:42:37] bah I mentally lost track of the cp1008 packaging thing somewhere [14:43:29] moritzm: fixed [14:48:35] ok, thanks [14:58:58] presentation went well, nobody fell asleep and I got a few good questions http://www.linux.it/~ema/slides/WMF_Traffic_DIBRIS.pdf [14:59:13] also filed https://phabricator.wikimedia.org/T190693 to extend our monitoring for this, BTW [15:03:52] lol \o/ [15:04:09] best part: 'OMG uncacheable why did we wait?' [15:04:30] :) [15:08:30] ema: congrats :D [15:17:58] 10Traffic, 10Operations, 10Patch-For-Review, 10Performance-Team (Radar): Define turn-up process and scope for eqsin service to regional countries - https://phabricator.wikimedia.org/T189252#4080917 (10ayounsi) >>! In T189252#4079025, @Liuxinyu970226 wrote: > What about Antarctica (AQ)? In addition, Antarc... [15:39:12] 10Traffic, 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Deprecate python varnish cachestats - https://phabricator.wikimedia.org/T184942#4080943 (10Vgutierrez) Current status: * varnishxcps **ready to be removed** (https://gerrit.wikimedia.org/r/421338) * varnishxcache is currently being u... [16:41:31] 10netops, 10Operations: eqiad 10G ports needs - https://phabricator.wikimedia.org/T190364#4081276 (10jcrespo) [16:45:33] 10netops, 10Operations: eqiad 10G ports needs - https://phabricator.wikimedia.org/T190364#4071230 (10jcrespo) I've clarified the database/backups provisioning service, so that it can comfortably recover in an emergency multiple databases at the same time, in case of catastrophic failure to reduce TTR, but also... [16:58:51] mark, I've found this while debugging the prometheus-client issue, https://github.com/wikimedia/PyBal/blob/master/pybal/__init__.py#L7 [16:59:12] basically that is forcing pybal to use test code in the production environment [16:59:20] and it's not good at all [16:59:49] what does that mean in practice, I'm at a loss in python-land sometimes [16:59:57] i don't know either [16:59:58] like, slowdown everything for extra assertions sort of thing?) [17:00:06] it included the __init__.py in test/ [17:00:11] *includes [17:00:18] that does a bunch of import * [17:00:37] basically an issue on a test could break pybal on start-up [17:00:43] ok [17:00:53] so pybal actually runs all tests on startup? [17:00:56] nope [17:00:59] but once all the useless test stuff is imported, shouldn't have much runtime cost right? [17:01:00] compiles all tests [17:01:01] just an import [17:01:01] must be a syntax errot though [17:01:02] right [17:01:06] yeah that's not ideal [17:01:49] also there could be some name clashing/masking potentially [17:02:08] 10Wikimedia-Apache-configuration, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): Cleanup remaining WikipediaMobileFirefoxOS references - https://phabricator.wikimedia.org/T187850#4081406 (10jcrespo) > I don't think we use mediawiki-config for anything We use them to run check_private_data.py, whic... [17:02:27] yes [17:02:42] the real problem would be if any of those modules are doing anything at import time [17:02:46] i've also fixed some bugs in existing test code that unilaterally replaced pybal classes with test classes [17:03:01] was an odd problem to debug [17:03:33] monkey-patching is possible in tests, yeah [17:03:35] mark: pybal testcases are pretty intrusive and interfere between each other [17:03:42] of course, it could be at this point it breaks without the import :) [17:04:05] they lack a lot of proper tearDown() that restores pybal state [17:04:11] well [17:04:22] should rarely need that in the first place [17:05:47] anyway i'd be surprised if pybal really breaks without the test import, the tests might though ;) [17:06:18] git blame says the line is there to make travis/coveralls happy [17:06:42] i don't know anything about testing infrastructure, i was surprised last week to see a problem affecting travis/coveralls but not locally [17:06:56] i have no idea how that stuff works (nor a desire to find out ;) [17:07:04] in my experience this happens for different version of dependencies [17:07:23] in this case it was because I had a class attribute starting with 'test' (so testAttr) which wasn't a test method [17:07:27] so travis was trying to execute it [17:07:29] locally it wasn't [17:07:38] probably a newer version which caught that itwasn't executeable or something [17:10:20] callable() probably [17:10:49] 10Domains, 10Traffic, 10Wikimedia-Apache-configuration, 10Operations: en-wp.org certificate error - https://phabricator.wikimedia.org/T190244#4081459 (10Dzahn) [17:10:55] 10HTTPS, 10Traffic, 10Operations, 10Patch-For-Review: Create a secure redirect service for large count of non-canonical / junk domains - https://phabricator.wikimedia.org/T133548#4081460 (10Dzahn) [17:11:59] mark: easy thing to test [17:12:17] add a --upgrade in the pip lines in .travis.yml [17:12:26] 19:06:56 i have no idea how that stuff works (nor a desire to find out ;) [17:12:31] i have better things to do with my limited tech work time ;) [17:13:00] 10Domains, 10Traffic, 10Wikimedia-Apache-configuration, 10Operations: en-wp.org certificate error - https://phabricator.wikimedia.org/T190244#4067490 (10Dzahn) This is blocked on T133548 (technical) and T101048 (policy). Yes, it's not just this one domain name. Though one of the tickets is about deciding... [17:24:56] 10Traffic, 10Operations, 10Performance: Resources and pages occasionally take seconds to respond or fail - https://phabricator.wikimedia.org/T189085#4081498 (10BBlack) [17:24:59] 10Traffic, 10Operations, 10Patch-For-Review, 10Performance-Team (Radar): Varnish HTTP response from app servers taking 160s (only 0.031s inside Apache) - https://phabricator.wikimedia.org/T181315#4081495 (10BBlack) [17:25:04] 10Traffic, 10Operations, 10Patch-For-Review: varnish-be: rate of accepted sessions keeps on increasing - https://phabricator.wikimedia.org/T189892#4081499 (10BBlack) [17:26:55] 10Traffic, 10Operations, 10ops-codfw: cp2006, cp2010: Uncorrectable Memory Error - https://phabricator.wikimedia.org/T190540#4076372 (10BBlack) Depooled both today, we should do that in general as these arise. [17:28:10] 10Traffic, 10Operations, 10Patch-For-Review: varnish: discard cold vcl - https://phabricator.wikimedia.org/T187778#4081531 (10BBlack) 05Open>03Resolved a:03BBlack This was fixed in https://gerrit.wikimedia.org/r/#/c/420432/ about the broader issues (which we still have, because some VCLs never go cold,... [17:33:31] 10Traffic, 10Operations, 10Patch-For-Review: Recurrent 'mailbox lag' critical alerts and 500s - https://phabricator.wikimedia.org/T174932#4081555 (10BBlack) > The 'varnish mailbox lag' icinga alerts as implemented in the parent task have been going CRITICAL for a while and in some cases result in 503s spikes... [17:34:01] 10Traffic, 10Operations, 10Patch-For-Review: Recurrent 'mailbox lag' critical alerts and 500s - https://phabricator.wikimedia.org/T174932#4081560 (10BBlack) [17:34:05] 10Traffic, 10Operations, 10Patch-For-Review, 10Performance-Team (Radar): Varnish HTTP response from app servers taking 160s (only 0.031s inside Apache) - https://phabricator.wikimedia.org/T181315#4081558 (10BBlack) [17:35:53] 10Traffic, 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Deprecate python varnish cachestats - https://phabricator.wikimedia.org/T184942#4081572 (10Vgutierrez) Blocked (still in use) varnish cachestats daemons: * varnishstatsd * varnishrls * varnishreqstats Folks, we need your help to mov... [17:36:10] 10Traffic, 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Add Prometheus client support for varnish/statsd metrics daemons - https://phabricator.wikimedia.org/T177199#4081575 (10Vgutierrez) [17:36:12] 10Traffic, 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Deprecate python varnish cachestats - https://phabricator.wikimedia.org/T184942#4081574 (10Vgutierrez) 05Open>03stalled [17:59:09] 10Traffic, 10Operations, 10Patch-For-Review, 10Performance-Team (Radar): Varnish HTTP response from app servers taking 160s (only 0.031s inside Apache) - https://phabricator.wikimedia.org/T181315#4081674 (10BBlack) I've pulled together a few other related open tasks that belong here. It seems fairly certa... [18:03:24] 10Traffic, 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Deprecate python varnish cachestats - https://phabricator.wikimedia.org/T184942#4081704 (10Pchelolo) - API Summary dashboard - this one uses `varnish.$dc.backends.be_{backend}` metric and relies on the actual backend being a part of... [18:37:52] 10netops, 10Operations: Security audit for tftp on install1001 - https://phabricator.wikimedia.org/T122210#4081830 (10ayounsi) Things must have changed since 2015. @Dzahn @Andrew Running `install1002:~$ sudo iptables -L -v -n | grep "udp dpt:69"` I see plenty of ACLs. I think next step is to audit that list,... [18:44:35] 10Traffic, 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Deprecate python varnish cachestats - https://phabricator.wikimedia.org/T184942#4081850 (10Ottomata) > Reqstats-otto dashboard I haven't looked at that dashboard in years, and I doubt anyone else has either. Deleted. :) [19:02:51] 10netops, 10Operations: eqiad 10G ports needs - https://phabricator.wikimedia.org/T190364#4081897 (10RobH) p:05Triage>03Normal I'm simply trying to reduce our number of 'needs triage' tasks in #operations. This seems to be an issue that is either normal, or higher priority. Due to the timeline of the 10G... [19:05:01] 10Domains, 10Traffic, 10Wikimedia-Apache-configuration, 10Operations: en-wp.org certificate error - https://phabricator.wikimedia.org/T190244#4081905 (10RobH) p:05Triage>03Normal As SRE Clinic Duty person this week, I'm setting this to normal priority. The items blocking it are also normal priority, a... [19:13:39] 10Traffic, 10Operations: How is Varnish errorpage enabled for empty 404 text/html from mw/index.php?action=raw - https://phabricator.wikimedia.org/T190450#4081937 (10Ragesoss) The behavior change last week was not limited to JSON pages. My app uses the mediawiki ruby API gem's `get_wikitext` [[https://github.c... [19:31:21] 10Traffic, 10Operations, 10Pybal, 10Patch-For-Review: pybal 1.15.2 dies with obscure errors without python-prometheus-client - https://phabricator.wikimedia.org/T190527#4082022 (10RobH) p:05Triage>03Low Setting this to normal priority as part of SRE clinic duty. After IRC discussion with @jgreen, it i... [19:48:41] 10Traffic, 10DNS, 10Mail, 10Operations, 10Patch-For-Review: Outbound mail from Greenhouse is broken - https://phabricator.wikimedia.org/T189065#4029896 (10RobH) >>! In T189065#4036314, @gerritbot wrote: > Change 417350 had a related patch set uploaded (by Herron; owner: Herron): > [operations/dns@master]... [20:07:19] 10Traffic, 10DNS, 10Mail, 10Operations, 10Patch-For-Review: Outbound mail from Greenhouse is broken - https://phabricator.wikimedia.org/T189065#4082180 (10herron) > Is there any reason we cannot merge this in advance of the greenhouse.io settings change? The change is immature as-is, unfortunately. Bef... [20:22:59] 10netops, 10Operations: Security audit for tftp on install1001 - https://phabricator.wikimedia.org/T122210#4082253 (10Dzahn) >Andrew wrote: >> from within the labs-vm subnet >ayounsi wrote: >> I think next step is to audit that list, especially for "higher risks" ranges, like Cloud or Sandbox. I think this i... [20:23:07] 10Wikimedia-Apache-configuration, 10Operations, 10Performance-Team (Radar): VirtualHost for mod_status breaks debugging Apache/MediaWiki from localhost - https://phabricator.wikimedia.org/T190111#4082254 (10Imarlier)