[02:00:24] 10Traffic, 06Operations, 10Phabricator: Phabricator needs to expose notification daemon (websocket) - https://phabricator.wikimedia.org/T112765#2686629 (10mmodell) >>! In T112765#2509512, @BBlack wrote: > There's a little bit of refactoring work (already in-progress) to do on the Varnish side to support it "... [02:16:38] 10Traffic, 06Operations, 10Phabricator: Phabricator needs to expose notification daemon (websocket) - https://phabricator.wikimedia.org/T112765#2686660 (10BBlack) >>! In T112765#2686629, @mmodell wrote: >>>! In T112765#2509512, @BBlack wrote: >> ... but even if that weren't ready in time we can use DNS hack... [05:00:48] 10Traffic, 06Operations, 10Phabricator, 13Patch-For-Review: Phabricator needs to expose notification daemon (websocket) - https://phabricator.wikimedia.org/T112765#2687202 (10mmodell) @bblack: https://gerrit.wikimedia.org/r/#/c/313937/ is a first-attempt at puppetizing the aphlict notification service [11:15:42] the traffic workboard looks better now :) [11:16:06] we still have 64 tasks in the backlog to categorize [11:17:16] context: in barcelona we started creating task categories and assigning tasks to different columns in the workboard to somehow make sense of the ~200 tasks in our backlog [11:21:40] unfortunately it does not seem to be possible to rename columns so we're stuck with suboptimal names in some cases... [11:22:08] "Caching Strategy" is pretty much anything varnish-related which is not v4 specific [11:22:23] but yeah at least the workboard does make some kind of sense now [11:23:07] ema: you can rename columns [11:23:17] can I? :) [11:23:24] https://phabricator.wikimedia.org/project/board/1201/edit/6730/ [11:24:22] top right settings icon in the workboard page, then manage board, then click on a column then edit column [11:24:44] not exactly the most obvious path :D [11:25:28] volans: you're right, it works :) [11:25:58] which sucks because that means we have no excuses and need to come up with decent names though :P [11:26:01] volans: thanks! [11:26:33] yw :) [12:45:14] 10Traffic, 10Citoid, 10ContentTranslation-CXserver, 10MediaWiki-extensions-ContentTranslation, and 5 others: Decom legacy ex-parsoidcache cxserver, citoid, and restbase service hostnames - https://phabricator.wikimedia.org/T133001#2688238 (10BBlack) [12:46:43] I'm sorting out more workboard junk [12:47:25] 07HTTPS, 10Traffic, 10MediaWiki-Email, 06Operations, 07Easy: Links in MediaWiki emails should respect the user's https preference - https://phabricator.wikimedia.org/T41676#2688250 (10BBlack) [12:48:43] 10Traffic, 10MediaWiki-extensions-ZeroPortal, 06Operations, 06Zero: Move proxy IP lists to META for Varnish XFF decoding - https://phabricator.wikimedia.org/T89838#2688259 (10BBlack) [12:52:04] 07HTTPS, 10Traffic, 10MediaWiki-Email, 06Operations, 07Easy: Links in MediaWiki emails should respect the user's https preference - https://phabricator.wikimedia.org/T41676#476434 (10Dzahn) The "Phabricator_maintenance" user re-added the project HTTPS. Then herald re-added Traffic and Operations. This is... [12:52:44] 10Traffic, 06Operations, 10RESTBase, 06Services, 07Service-Architecture: Proxying new services through RESTBase - https://phabricator.wikimedia.org/T96688#2688264 (10BBlack) Clearly at least some new services are being deployed as RB-based services, and some legacy services have been converted (but a few... [12:55:10] we should really remove the herald rule that makes #HTTPS imply #Traffic [12:55:29] there are too many cases where something is nominally HTTPS-related but not Traffic-related [12:55:59] (or people could accept that tags are more about teams/projects than notional broad categorization, but that seems like a losing battle, at least for #HTTPS) [13:05:12] 10Traffic, 10Analytics-Cluster, 06Operations: Respect X-Forwarded-For only from trustworthy sources - https://phabricator.wikimedia.org/T56783#2688311 (10BBlack) So I've linked above that we have a separate task already about moving Zero's trusted proxy lists to metawiki, for more-transparent / community man... [13:09:07] 10Domains, 10Traffic, 06Operations, 10Wikimedia-Site-requests, 13Patch-For-Review: Private wiki for Project Grants Committee - https://phabricator.wikimedia.org/T143138#2688325 (10BBlack) [13:25:40] 10Traffic, 10DNS, 10Mail, 06Operations: Set up role accounts and feedback loops (FBL) with all providers - https://phabricator.wikimedia.org/T106664#2688423 (10BBlack) [13:34:14] 10Traffic, 06Operations: cache_upload should give an informative 404 rather than 403 on req.http.host != upload.wikimedia.org - https://phabricator.wikimedia.org/T118394#2688475 (10BBlack) 05Open>03Resolved a:03BBlack This was fixed some time ago, independently of this ticket I guess. [13:43:45] 10Traffic, 10MediaWiki-API, 10Monitoring, 06Operations, 06Services: Set up action API latency / error rate metrics & alerts - https://phabricator.wikimedia.org/T123854#2688517 (10BBlack) Is there stuff left to do here beyond what's present in the current dashboards? I mean, our metrics can always be "be... [13:43:45] FYI the varnish/prometheus config is puppetized now in beta the same as it would look in production but just with text and upload, e.g. https://beta-prometheus.wmflabs.org/beta/graph#%5B%7B%22range_input%22%3A%226h%22%2C%22end_input%22%3A%22%22%2C%22step_input%22%3A%22%22%2C%22stacked%22%3A%22%22%2C%22expr%22%3A%22sum(rate(varnish_sma_g_bytes%5B5m%5D))%20by%20(job)%22%2C%22tab%22%3A0%7D%2C%7B%22ra [13:43:52] nge_input%22%3A%226h%22%2C%22end_input%22%3A%22%22%2C%22step_input%22%3A%22%22%2C%22stacked%22%3A%22%22%2C%22expr%22%3A%22sum(rate(varnish_sma_g_bytes%5B5m%5D))%20by%20(service)%22%2C%22tab%22%3A0%7D%5D [13:43:58] well, that didn't work [13:44:21] but dashboards can be created at https://grafana-labs-admin.wikimedia.org for example [13:55:51] 10Wikimedia-Apache-configuration, 10MediaWiki-extensions-ZeroBanner, 06Operations, 06Reading-Web-Backlog, and 3 others: m.wikipedia.org incorrectly redirects to en.m.wikipedia.org - https://phabricator.wikimedia.org/T69015#2688575 (10BBlack) [14:10:57] ok so Backlog has been renamed Triage. I figure under this temporary schema, anything that lands there needs to be moved Elsewhere [14:11:44] and there's a few new columns: BadHerald means something like "their #HTTPS tag seems appropriate and/or unavoidable, which herald-adds #Traffic, but this task is wholly unrelated to any specific Traffic task" [14:12:03] Watching means we're watching that ticket with some legitimate interest, but nothing to actually do (at least, presently) [14:12:22] General is for things that don't fit in all of the obvious columns describing projects or functional areas [14:12:47] nice [14:13:01] arguably none of this should be organized in the columns of a single workboard like this, but it's a start to mapping things out mentally and deciding on a better model, anyways [14:13:15] (at which point we file some meta-ticket about fixing up our tag/project/herald structure, etc) [14:18:45] 10Traffic, 10MediaWiki-API, 10Monitoring, 06Operations, 06Services: Set up action API latency / error rate metrics & alerts - https://phabricator.wikimedia.org/T123854#2688672 (10GWicke) @bblack, do we have basic latency / error rate alerts set up for the API? [14:42:10] 10Traffic, 10MediaWiki-API, 10Monitoring, 06Operations, 06Services: Set up action API latency / error rate metrics & alerts - https://phabricator.wikimedia.org/T123854#2688775 (10BBlack) That's fair, that is what's in the title. I think I was thinking one thing and saying another above. I was thinking... [14:46:34] 10Traffic, 06DC-Ops, 06Operations, 10ops-codfw: lvs2002 Embedded Flash/SD-CARD iLO errors - https://phabricator.wikimedia.org/T126321#2688790 (10BBlack) I guess I dropped this, sorry! @Papaul when's a good time? The work on our end is pretty trivial, we just need to be working together for a few minutes b... [15:28:09] 07HTTPS, 10Traffic, 06Operations, 06WMF-Communications, 07Security-Other: Server certificate is classified as invalid on government computers - https://phabricator.wikimedia.org/T128182#2688920 (10Florian) @BBlack: Even if the case is closed I would use it for reference in OTRS tickets, so this isn't the... [15:36:43] 07HTTPS, 10Traffic, 06Operations, 07Wikimedia-Incident: Make OCSP Stapling support more generic and robust - https://phabricator.wikimedia.org/T93927#2688956 (10greg) >>! In T93927#2684299, @BBlack wrote: > Arguably, if the link from the incident to this open ticket of broader scope is annoying, it could b... [16:05:13] so, now that we have this workboard breakdown (let's call it version 1 brainstorming...) [16:05:31] obviously, a subproject can have column breakdowns within its own workboard, too [16:06:16] we should probably thing in terms of both long-term subproject areas, and also more-specific "projects" like Varnish4 being something else (one-off cases for major efforts) [16:08:55] my first pass at remapping those would be: "Caching" (was "Caching Strategy") and even Varnish4, there's probably two real long-term subproject areas in there: Caching and Request Routing/Mangling (which might be implemeted via cache VCL code, but that's neither here nor there in the long term) [16:10:01] TLS and some of the General tasks (e.g. TFO), maybe go under some kind of TCP/TLS/HTTP Termination sub-task? [16:10:21] loadbalancer+etcd go together into whatever label covers pybal/lvs stuff [16:10:43] DNS names/infra seem like sub-columns within one subproject [16:10:53] all the above is just thinking out loud and probably-wrong :) [16:13:32] I guess one obvious further brainstorming step would be to break up that Caching column into Caching and ReqMangle/Route [16:14:00] the latter being things like regex substitutions we perform on some appservices' requests, adding new backend services, inter-dc routing, etc... [16:14:15] whereas the former is more about things that actually affect cache-correctness, purging, hitrates, etc [16:23:24] it looks to me that the way you've organized the workboard as of now, the columns should be more sub-tags/projects of Traffic instead [16:24:03] this is because my understanding is that the Phab workboard is basically inspired for a Kanban-like workflow in which a task moves from one column to another during it's life [16:24:24] and they are not designed for splitting the tasks in different categories [16:24:40] but of course they can be used for that too :) Just my 2 cents [16:28:40] 07HTTPS, 10Traffic, 06Operations, 10Wikimedia-General-or-Unknown, 07JavaScript: Use Upgrade Insecure Requests on Wikimedia wikis - https://phabricator.wikimedia.org/T101002#2689152 (10Krinkle) [16:28:59] 10Traffic, 06DC-Ops, 06Operations, 10ops-codfw: lvs2002 Embedded Flash/SD-CARD iLO errors - https://phabricator.wikimedia.org/T126321#2689153 (10Papaul) @BBlack let me know when is best for you. Any time from 9:30am is okay with me. [16:36:41] volans: after careful consideration and skimming through the mediawiki.org phab worfklow page we've decided that we'd rather avoid moving tasks continuously from one column to the other to reflect a state that nobody really cares about [16:37:32] and then we've identified the endless 'backlog' column as useless, failed to create new subprojects, decided to create columns for now and then think about it again [16:38:52] ema: you should be the one that cares about the state, what's in the queue etc... if you don't need to be able to get the state of the things to synchronize among you then this approach is perfectly valid [16:39:41] for priorities you can use the vertical order, moving tasks up or down [16:39:51] right [16:39:54] and I fully agree that an endless list of backlog is useless [16:42:15] bblack: agreed that DNS should be a subproject with names/infra as columns [16:43:33] something along the lines of TCP/TLS/HTTP termination (bikeshedding on the name of course) also seems like a good subproject [16:45:49] Proto-termination? :-P [16:47:53] Caching / Request routing & mangling could probably be one single subproject with obvious distinction between the columns [16:48:24] in some way, caching is a type of request routing [16:48:34] one in which the route stops before it gets to the bottom of the stack :) [16:48:59] then it's also http termination :) [16:49:21] heh [16:51:07] so: worthy of whole, persistent, long-term subprojects: Traffic-DNS (infra+names), Traffic-CachingRouting (bikeshed name?), Traffic-Termination (bikeshed... but public TCP, TLS, HTTP/[12], etc), and maybe Loadbalancing (LVS/pybal/etcd-pooling, etc) [16:51:18] something like that? [16:51:26] EPICS [16:51:58] oh no I guess not ;) [16:52:10] [EPIC] Create the one true termination, caching, and routing configuraiton that's so general and magical that we never hav eto work on it again and it solves all future needs [16:52:34] i think riccardo should automate that yeah [16:52:47] lol, as long as we don't tell Alex (for EPIC) ;) [16:53:44] I like the general idea of subprojects being Traffic-Foo [16:54:02] otherwise we get into the same mess we have today where there's generic tags like #DNS #Domains #HTTPS [16:54:31] we can declare policy all we want about how tags work, but people will stick those on all kinds of things unrelated to what we're doing, and they're still legitimate "tags" in some generic sense... [16:55:15] there's a ton of "HTTPS" issues in the MediaWiki realm that we have little involvement in, but they're probably not going to make a #MediaWiki-HTTPS-Issues either [16:55:17] agree, although I never tried subprojects on Phab, to see if they do some "magic" about that [16:55:35] and right now #HTTPS implies #Traffic implies #Operations [16:57:02] ditto for #DNS or #Domains getting tagged on tickets like "create a new wiki for foo", when we're really only involved in some subtask of that once they're done bikeshedding names and the meaning of life, the universe, and everything. [16:58:44] 10netops, 06Operations, 10Ops-Access-Requests: Access to network devices - https://phabricator.wikimedia.org/T147061#2689222 (10RobH) a:03faidon My understanding is that all access to the switches is currently handed by @faidon. As such, I'm escalating this task to him as part of my clinic duty week, and... [16:59:04] 10netops, 06Operations, 10Ops-Access-Requests: elukey - Access to network devices - https://phabricator.wikimedia.org/T147061#2679684 (10RobH) [17:30:29] ema: can you sanity-check the thinking in https://gerrit.wikimedia.org/r/#/c/313847/ ? [17:37:28] bblack: sgtm [17:38:40] I put the backends on weekly restart instead of daily yesterday, so probably should be clear to restart frontends relatively-quickly again. I'll probably push it out later today and do some staggered FE restarts. [17:39:14] we'll see how the weekly thing goes. should be fine at least for the first few days, but starting thursday or so will have to monior mail backlogs more to verify [17:39:29] s/mail/varnishd mailbox/ :) [17:40:34] maybe we get lucky and the new jemalloc setting reduces FE mem consumption and we can bump the percentage back up from 40% too, but no rush on any of that [17:41:26] really only ulsfo was having issues at 50%, with its smaller total ram size. probably the "right" answer there is to include some constant in the calculation to account for nginx, backend fscache pressure, rest of system, etc [17:41:58] e.g. "(total_ram - xxG) * 0.60", instead of "total_ram * 0.5" or whatever [17:57:18] 10Traffic, 06Operations: OpenSSL 1.1 deployment for cache clusters - https://phabricator.wikimedia.org/T144523#2689458 (10BBlack) The chapoly prehack and +3des stuff are in the first 3 commits here and should rebase fine as they are: https://phabricator.wikimedia.org/diffusion/ODFP/history/wmf-1.1/ . Note the... [21:56:37] one of the eqiad frontends has been up for ~15 minutes now and is only using 10G of memory for FE cache heh [21:56:47] I think this is the first full restart since the 256KB frontend cutoff, too [21:56:58] so we wouldn't have seen an effect primarily from that [21:57:11] will be interesting to see what they rise to after a day or a week [21:57:40] (if they can't max out the available malloc space in a week, we should probably raise the 256KB cutoff just to catch more trailing-edge percentage) [22:37:49] after an hour, ~24G [22:37:59] so yeah, it might still max out its storage before 24h is up [22:38:50] in completely unrelated news: dev.ssllabs.com is now showing Safari 10 stuff, as well as Chrome and FF on WinXP [22:39:15] Safari 10 for OSX supports chapoly, but iOS doesn't (seems backwards, maybe they'll fix it down the road?) [22:39:41] or it's possible that most of their iOS devices new enough for iOS 10 are new enough to have AES in their ARM cores, who knows [22:40:01] and Chrome/FF-for-XP just shows what we expected: they support modern strong ciphers in both cases