[09:26:52] cp3030 upgraded to latest libssl1.1 [09:27:17] we're currently running it on cp2002, cp4008 and cp3030 [11:29:58] 10Domains, 10Traffic, 06Operations: Using wikimedia.ee mail address as Google account - https://phabricator.wikimedia.org/T158638#3042523 (10Kaarel_Vaidla) [11:52:15] comments welcome on https://gerrit.wikimedia.org/r/#/c/338953/, the idea is to allow varnish probes for applayer backends too [11:54:32] even with one backend only, probes can be good to 1) implement grace in case of sick backend 2) return the 50x faster 3) inspect backend health with varnishlog [12:08:11] 10Traffic, 06Operations, 10Wikimedia-General-or-Unknown: Disable caching on the main page for anonymous users - https://phabricator.wikimedia.org/T119366#3042597 (10kruusamagi) I removed the date info from the main page of Estonian Wikipedia, but it only helps to hide the issue and not to solve it (the weekl... [12:53:33] we've had another 500 spike between 2017-02-21T11:22:49 and 2017-02-21T11:22:54 [12:53:42] https://grafana.wikimedia.org/dashboard/db/varnish-aggregate-client-status-codes?panelId=2&fullscreen&var-site=All&var-cache_type=text&var-status_type=5&from=1487674025998&to=1487677267264 [12:55:01] cp3042 was the host affected this time (still on the old openssl, so it seems yesterday's errors happened on cp4008 just by chance) [12:57:47] and the errors are all caused by the same source IP [13:19:58] ema: same IP of yesterday? [13:23:30] volans: nope, same IP causing all the errors in the above timeframe [13:24:44] ok, wound have been someone playing with it if it was the same going to ulsfo one day and esams the next :D [13:25:11] indeed, someone circumventing geodns [13:25:17] but no, that's not the case :) [14:09:01] 10netops, 06Operations: cr2-knams<->asw-esams GBLX fiber down - https://phabricator.wikimedia.org/T158647#3042814 (10faidon) [14:22:11] 10Domains, 10Traffic, 06Operations, 10Wikimedia-Site-requests: Consider mw.org being added as a redirect to mediawiki.org - https://phabricator.wikimedia.org/T158490#3042853 (10Zppix) >>! In T158490#3039427, @Platonides wrote: > MW is the country-code of Malawi (ISO 3166-1), so I find unlikely we would be... [14:22:29] 10Domains, 10Traffic, 06Operations, 10Wikimedia-Site-requests: Consider mw.org being added as a redirect to mediawiki.org - https://phabricator.wikimedia.org/T158490#3042854 (10Zppix) >>! In T158490#3039567, @Matthewrbowker wrote: >>>! In T158490#3039326, @Zppix wrote: >> @Aklapper I meant like if abbrev'd... [14:58:07] meanwhile, the mailbox expiry on cp1074 has been growing for the past 20 minutes [15:05:52] same machine that gave issues on Friday as it turns out [15:06:09] and on Wednesday [15:07:05] and sure enough varnish-be has now been up for two days [15:08:23] so yeah, if left on its own (no restarts), cp1074 seems to be affected by mailbox lag roughly every two days [15:10:40] the number of cached objects also goes up in a weird way roughly at the same time as the mbox lag increase [15:15:31] still no 503s [15:17:37] heh https://www.varnish-cache.org/lists/pipermail/varnish-misc/2013-August/023289.html [15:18:10] > Our best guess based on the data supplied is that the tiny object handling in [15:18:13] the file backend is the source of this. Many small objects mixed with [15:18:15] larger objects makes a bad combination on -sfile. [15:26:30] ok so it seems to make sense that the number of cached object goes up while the expiry thread is lagging behind, but then why isn't the nuke rate decreasing? [15:43:11] I've added a graph for allocator failures https://grafana-admin.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=22&fullscreen&from=now-7d&to=now [15:54:22] uh yeah, the object expiration rate of course is the one that we should expect to see going down [15:54:47] and indeed it's been 0 since 14:32ish [16:05:31] still no 503s, perhaps a better alert would be when exp_mailed - exp_received gets close to n_objecthead? [16:06:07] I've updated the graph to plot both the lag and n_objecthead meanwhile https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=21&fullscreen [16:18:08] mmh, no at least in cp1067's case we never even approached n_objecthead and still returned errors https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=21&fullscreen&from=1486926014089&to=1487244671042&var-server=cp1067&var-datasource=eqiad%20prometheus%2Fops [16:21:05] oh and cp1067 is _text, not _upload! [16:24:45] anyways, we haven't discovered much new: the file storage backend doesn't work for our workload [16:34:23] the only news is that we're not alone (https://www.varnish-cache.org/lists/pipermail/varnish-misc/2013-August/023289.html), something which I don't think we were aware of [16:39:22] moritzm: I got sidetracked :) upgrading openssl on maps and misc now [16:40:14] ok! [16:42:45] 10Traffic, 06Analytics-Kanban, 06Operations, 13Patch-For-Review, 15User-Elukey: Periodic 500s from piwik.wikimedia.org - https://phabricator.wikimedia.org/T154558#3043356 (10elukey) Checked the oxygen logs and the following UA is the only one getting 503s during the past 21 days: ```244268 "Wikipedia/10... [16:52:11] 10Traffic, 06Analytics-Kanban, 06Operations, 13Patch-For-Review, 15User-Elukey: Periodic 500s from piwik.wikimedia.org - https://phabricator.wikimedia.org/T154558#3043475 (10Milimetric) Ping @Fjalapeno this UA is the iOS app, right? Any help you can provide Luca in finding out why we might be seeing 503... [16:55:23] 10Traffic, 06Analytics-Kanban, 06Operations, 13Patch-For-Review, 15User-Elukey: Periodic 500s from piwik.wikimedia.org - https://phabricator.wikimedia.org/T154558#3043511 (10elukey) Just adding a note: I am seeing also others similar UA, that follows the same pattern.. but nothing else. I suspect that I... [17:06:35] and after hours of struggle, cp1074's expiry thread just managed to catch up without depooling/restarting [17:10:05] 10Traffic, 06Analytics-Kanban, 06Operations, 06Wikipedia-iOS-App-Backlog, and 2 others: Periodic 500s from piwik.wikimedia.org - https://phabricator.wikimedia.org/T154558#3043573 (10Fjalapeno) [17:11:36] ema: any idea if we'd expect this to get better or worse with shorter TTLs on objects? [17:11:55] I would guess better (less eviction contention) but not sure [17:13:51] 10Traffic, 06Analytics-Kanban, 06Operations, 06Wikipedia-iOS-App-Backlog, and 2 others: Periodic 500s from piwik.wikimedia.org - https://phabricator.wikimedia.org/T154558#3043580 (10Fjalapeno) @Milimetric having @joewalsh verify this for you [17:19:01] bblack: I'm not sure, perhaps we should also plot LCK.exp.locks and see how that changes with shorter TTLs? [17:21:02] anyways the object expiration rate on upload backends is very low [17:45:19] 10Domains, 10Traffic, 06Operations: Using wikimedia.ee mail address as Google account - https://phabricator.wikimedia.org/T158638#3042523 (10Reedy) https://github.com/wikimedia/operations-dns/blob/master/templates/wikimedia.ee If you follow "Add a record to your domain settings (Recommended)", and provide t... [17:48:05] bblack: I've added expiry lock operations at the bottom of https://grafana-admin.wikimedia.org/dashboard/db/varnish-machine-stats [17:48:53] after recovering from the lag they seem to be higher than before? [17:53:48] plan for tomorrow: finish the openssl upgrades (text/upload), perhaps start with upgrading varnish to 4.1.5 [17:53:51] o/ [17:54:02] sounds awesome :) [17:54:24] ema: I looked at your probe patch too. it's good on its own, but [17:54:47] ema: I wanted to get rid of those merge of defaults hashes, so the data structures can be moved to hieradata [17:55:04] maybe can work that in as well, and just put the probe entry with the backend definition itself [17:56:09] sure thing! [17:57:30] 10Domains, 10Traffic, 06Operations, 10Wikimedia-Site-requests: Consider mw.org being added as a redirect to mediawiki.org - https://phabricator.wikimedia.org/T158490#3038649 (10CRoslof) One- and two-character .org domain names aren't available for general registration. See, for example, this press release... [18:00:06] 10Domains, 10Traffic, 06Operations, 10Wikimedia-Site-requests: Consider mw.org being added as a redirect to mediawiki.org - https://phabricator.wikimedia.org/T158490#3038649 (10Dzahn) Even if we would be able to get it and wanted to use it, it would still be blocked on T133548. [18:04:14] 10Domains, 10Traffic, 06Operations: Using wikimedia.ee mail address as Google account - https://phabricator.wikimedia.org/T158638#3043748 (10Reedy) Oh, and if you're wanting to use Google Apps like that.. I suspect your mail server MX records will need updating - https://github.com/wikimedia/operations-dns/b... [18:04:42] 10Domains, 10Traffic, 06Operations, 10Wikimedia-Site-requests: Consider mw.org being added as a redirect to mediawiki.org - https://phabricator.wikimedia.org/T158490#3038649 (10demon) Heh, I had this idea like 5 **years** ago but never felt like bothering to follow-up on it. Plus T133548 [18:12:03] 10Traffic, 06Operations, 10Wikimedia-Mailing-lists: convert lists.wikimedia.org certificate to LetsEncrypt (deadline:2017-03-02) - https://phabricator.wikimedia.org/T154917#3043827 (10RobH) p:05Triage>03High a:05RobH>03BBlack I'm just not getting through this fast enough, so I'm reassigning this to B... [18:15:16] 07HTTPS, 10Traffic, 06Operations, 10Wikimedia-Shop: store.wikimedia.org HTTPS issues - https://phabricator.wikimedia.org/T128559#3043838 (10Aklapper) @Jseddon / @MBeat33: Any news? [18:23:20] 10Traffic, 06Analytics-Kanban, 06Operations, 06Wikipedia-iOS-App-Backlog, and 2 others: Periodic 500s from piwik.wikimedia.org - https://phabricator.wikimedia.org/T154558#3043886 (10Fjalapeno) @Milimetric @elukey verified that this is the iOS app [18:23:38] 07HTTPS, 10Traffic, 06Operations, 10fundraising-tech-ops: update SSL certificate for benefactorevents.wikimedia.org by 2017-03-02 - https://phabricator.wikimedia.org/T158684#3043892 (10Jgreen) [18:24:16] 07HTTPS, 10Traffic, 06Operations, 10fundraising-tech-ops: update SSL certificate for benefactorevents.wikimedia.org by 2017-03-02 - https://phabricator.wikimedia.org/T158684#3043908 (10Jgreen) @EWilfong_WMF are you the right point of contact for Trilogy for this? [18:26:59] 07HTTPS, 10Traffic, 06Operations, 10fundraising-tech-ops: update SSL certificate for benefactorevents.wikimedia.org by 2017-03-02 - https://phabricator.wikimedia.org/T158684#3043925 (10RobH) Please note that some potential details for this are also on private task T156849. However, relevant info has been... [18:43:21] 10Traffic, 06Analytics-Kanban, 06Operations, 06Wikipedia-iOS-App-Backlog, and 2 others: Periodic 500s from piwik.wikimedia.org - https://phabricator.wikimedia.org/T154558#3043971 (10JoeWalsh) @Milimetric this UA is from the iOS app. In testing locally, I didn't see any 503s. A potential cause of the surge... [20:23:47] 07HTTPS, 10Traffic, 06Operations, 10fundraising-tech-ops: update SSL certificate for benefactorevents.wikimedia.org by 2017-03-02 - https://phabricator.wikimedia.org/T158684#3044424 (10EWilfong_WMF) @Jgreen Yes, I will be the point of contact for this update. This domain is hosted using Azure's App Servic... [22:47:13] 10Domains, 10Traffic, 06Operations, 10Wikimedia-Site-requests: Consider mw.org being added as a redirect to mediawiki.org - https://phabricator.wikimedia.org/T158490#3044907 (10Zppix) With the information @CRoslof provided I'm going to consider this task denied? Anyone disagree? [22:55:04] 10Domains, 10Traffic, 06Operations, 10Wikimedia-Site-requests: Consider mw.org being added as a redirect to mediawiki.org - https://phabricator.wikimedia.org/T158490#3044918 (10Matthewrbowker) >>! In T158490#3042854, @Zppix wrote: >>>! In T158490#3039567, @Matthewrbowker wrote: >>>>! In T158490#3039326, @Z... [22:56:33] 10Domains, 10Traffic, 06Operations, 10Wikimedia-Site-requests: Consider mw.org being added as a redirect to mediawiki.org - https://phabricator.wikimedia.org/T158490#3044919 (10MaxSem) 05Open>03declined [23:07:15] 10netops, 06Operations, 10ops-codfw: codfw:ms-be2028-ms-be2039 switch port configuration - https://phabricator.wikimedia.org/T158714#3044966 (10Papaul) [23:14:09] 10Domains, 10Traffic, 06Operations, 10Wikimedia-Site-requests: Consider mw.org being added as a redirect to mediawiki.org - https://phabricator.wikimedia.org/T158490#3044987 (10Dzahn) Having multiple URLs for the same content is also bad for "SEO" and we already have w.wiki as a generic URL shortener. [23:34:51] 10Traffic, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 15 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#3045037 (10GWicke) [23:43:12] 10Traffic, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 15 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#3045070 (10GWicke) >>! In T66214#2981357, @Gilles wrote: > Something that's missing in the current plan, however, is the swift sharding information that i... [23:54:39] 10netops, 06Operations, 10ops-codfw: codfw:ms-be2028-ms-be2039 switch port configuration - https://phabricator.wikimedia.org/T158714#3045077 (10RobH) 05Open>03Resolved all ports have been enabled, had descriptions set, and placed in the private vlan for their respective rows.