[09:29:25] 10Traffic, 10DNS, 06Discovery, 06Labs, and 3 others: multi-component wmflabs.org subdomains doesn't work under simple wildcard TLS cert - https://phabricator.wikimedia.org/T161256#3133149 (10grin) >>! In T161256#3128643, @MaxSem wrote: > Now, in the time of HTTP/2.0 over TLS, there are modern pipelining t... [09:56:15] 10Traffic, 10Analytics, 06Operations, 06Reading-Web-Backlog: mobile-safari has very few internally-referred pageviews - https://phabricator.wikimedia.org/T148780#2732531 (10Nemo_bis) >>! In T148780#2891117, @mforns wrote: > Until today, there are certain browser versions that are not populating the referre... [12:39:04] 10Traffic, 06Operations, 06Performance-Team, 06Reading-Web-Backlog, and 3 others: Performance review #2 of Hovercards (Popups extension) - https://phabricator.wikimedia.org/T70861#3133748 (10Gilles) I take issue with the idea of delaying display of data once it's available, through animation or otherwise.... [14:57:36] hey bblack, do you have a sec for a quick varnish-switchDC-related question? :) [15:15:44] volans: sure [15:20:33] it's exactly about the same topic you were discussing with j.oe, from the other point of view :) what is the likelihood that for the switchover we'll not need anymore this to be merged? https://gerrit.wikimedia.org/r/#/c/284400/3/hieradata/common/cache/text.yaml [15:21:36] bblack: and also to be sure what else might be needed for the switchover automation regarting this step [15:24:58] volans: I can't really speak to the likelihood yet I think. It's kind of hanging in the balance, depends whether patch gets merged or not. [15:26:08] ok, the switchover has already the task to enable puppet and run puppet on those hosts in case it will not be merged, so no problem on that side [15:26:13] volans: what's the overall you're looking at for cache-level instructions for apr19? same list as before on wikitech? [15:26:47] bblack: yes, was the change from https://wikitech.wikimedia.org/wiki/Switch_Datacenter#Phase_5_-_apply_configuration point 5 [15:27:57] so the current switchdc automation assumes that we merge that beforehand the RO period and the automation will take care of enabling puppet and running it on Role::Cache::Text hosts [15:28:02] that link seems odd there, seems more like a switch-back link (it's replacing codfw with eqiad) [15:28:13] yes it was for the switchback [15:28:48] so in the case the change will not be merged I think we're covered [15:29:18] while in the case that it will land in prod in time, then just a change of entries in etcd/discovery will be needed? [15:29:44] I'd like to have both cases covered by the automation given that we're not sure if we'll have it or not [15:31:02] I guess, rewinding a little, ... when you say covered by the automation, does that include a pre-staged set of commits to be merged? [15:31:17] (I think we had such a thing before) [15:32:00] because that commit link is old, and backwards for the initial switch for MW [15:32:08] and those instructions are missing the similar commits to move other services [15:33:20] no, the automation will not merge or puppet-merge anything (on purpose) [15:33:56] the idea would be to disable puppet where needed (seems only for cache-text as of now) beforehand [15:34:10] manually merge the change and then as part of the automation puppet will be enabled and force-run [15:34:13] ok so we're still going to have a sort of human runbook of pre-staged commits to rebase->merge during the switch, and it will also list steps like the salt commands in the old list, but now with cumin? [15:34:14] to apply the change [15:34:45] right now we still need 1 commit in puppet (this one) and 3 in mediawiki-config, but on 3 different files [15:34:49] we didn't disable puppet in the old version of this [15:35:07] one of the pain-points with that is, of course, ensuring it actually applied the new change [15:35:23] but just disabling ahead of time in a fast sequence of steps doesn't really ensure that either, given variable agent timing [15:35:24] I can run any command you want to ensure it [15:35:52] so this loops back to the whole "run-no-puppet echo" pattern, which is clearly not an ideal solution [15:36:20] what we often want day-to-day, and especially in these steps, is really a way to make the system execute what we *mean*, which is: [15:36:49] why all this? we will merge the change after having disabled puppet [15:36:55] "merge puppet commit, ensure it has successfully applied to all typeX nodes" [15:37:08] the reason why I suggested to disable puppet and merge beforehand is to avoid any manual step during the RO period [15:37:18] including gerrit rebase and CI and all of this [15:37:23] depends on the speed we work at and agent timing, there are races in most known ways to do this [15:38:05] "agent disable; merge commit; enable+run puppet" -> leaves a potential race where agent was already running before disable, and is still running when you enable+run [15:38:15] sure, but this will not be the case [15:38:26] the disable puppet as of now is not even part of the automation [15:38:33] I can add it tu be run 10 minutes before [15:38:34] and/or [15:38:43] run it and ensure that no puppet run was running [15:38:46] but the merge will be manual [15:38:49] so slow [15:39:26] so, we also need to know that the agent ran successfully [15:39:36] (without scrolling through outputs looking for red text or whatever) [15:39:43] sure from the exit code [15:40:11] switchdc will not run --test, but the equivalent without the puppet option for the expanded exit codes [15:40:33] we also have a lot of other competing and relevant commits in various steps/areas of this procedure [15:40:36] and I'm happy to add any additional command that can ensure/verify all is good [15:40:44] no, they all went away [15:40:48] with the discovery [15:40:51] at least in puppet [15:40:59] I'm not sure it's general-case applicable to say "we'll disable puppet several minutes before procedure start, then enable+run as we go" [15:41:14] because several different parts will require merge->run-puppet on the same sets of nodes [15:41:31] and putting 10 minute waits in every time we re-disable puppet is bad for overall timeline [15:42:06] I think that in the general case we shouldn't have anything puppet-dependent for this kind of things [15:42:11] is not configuration, it's live state [15:42:21] well, yes, but we're clearly not there yet [15:42:39] if we were there, this would all just be a series of confctl commands [15:43:17] they mostly are, see T160178 [15:43:17] T160178: MediaWiki Datacenter Switchover automation - https://phabricator.wikimedia.org/T160178 [15:43:33] except the cache-related things, which never have been [15:44:29] the complexity there is deep, I don't know that we'll ever have it fully confctl-driven before we move away from VCL entirely [15:44:57] but regardless, we knew going into this that the cache things were staying commit-driven. it's just a question of whether it's the same types of commits we did a year ago, or new ones. [15:45:16] eheheh [15:45:43] ? [15:46:04] the important is to have the right commit ;) [15:46:26] anyways, back to "pain points" (which were there last year too, and I think with more automation around them, the pain will increase) [15:47:34] even on the same set of nodes (e.g. all cache::text), during the DC switchover process, we'll need to roll out multiple puppet commits (merge -> agent run), and at each step we need to ensure that the agent actually applied the newly-merged stuff successfully. Last year a lot of that was manually staring at salt outputs, etc. [15:48:21] in general, we don't have a good workflow around the process of quickly and reliably doing "merge change X; ensure puppet successfully applied X to all nodes in set Y" [15:48:59] on this I agree, but where are those multiple merges in https://wikitech.wikimedia.org/wiki/Switch_Datacenter ? [15:50:45] they're not there because those steps look incomplete. it only really goes into detail for MediaWiki [15:50:59] well, and thumbs from the cache perspective, but I'm not sure we're doing it quite the same way this year [15:51:25] there would be similar merge->run of puppet changes for all the other services behind cache_text (e.g. restbase, citoid, cxserver) [15:52:58] regardless, even with what's documented in detail on that page, you have separate commits pushes for the mediawiki stuff on cache::text, and also the cache::route_table stuff for cache::text [15:53:53] anyways, I think the merge->run thing is tractable. we have ways to do it now, we should just document whatever way we're going to do it [15:54:19] are not the other traffic changes separated from the mediawiki switch? In the sense that they happened after the MW switchover? [15:55:21] outside of the RO period [15:55:39] yes but they're still part of the overall timeline at switch time [15:55:52] surely we're not going to re-disable puppet and wait 10 minutes between each part [15:56:21] so this is one way, stolen from some recent shell-level work on the same problem elsewhere: [15:56:21] sure [15:56:27] while :; do puppet agent -t; if [ $? == 0 -o $? == 2 ]; then break; elif [ $? == 1 ]; then sleep 1; else die "Puppet agent failed, aborting"; fi; done [15:56:53] ^ given a starting point of assuming puppet isn't disabled, if you merge->puppet-merge then run the above on a node, it will reliably apply the change [15:57:16] (or at least, fail in an obvious way if the agent run breaks because random master connectivity problems or whatever) [15:58:10] ok [15:58:42] the sleep case is for "agent already running" (to wait it out since it could be applying from before puppet-merge) [15:58:59] and the break case is both types of success (noop, or changes made) [15:59:21] 1: The run failed, or wasn't attempted due to another run already in progress. [15:59:33] from puppet agent man page, so 1 can also be a failure [15:59:35] nice [15:59:59] 10Wikimedia-Apache-configuration, 10Wikidata, 10ArchCom-RfC (ArchCom-Approved): Canonical data URIs and URLs for machine readable page content - https://phabricator.wikimedia.org/T161527#3134380 (10daniel) [16:10:54] bblack: good morning! I am on a meeting and will escape, but I polished the rspec/puppet patch to generate gdnsd config files https://gerrit.wikimedia.org/r/#/c/343747/ :D [16:11:05] it is far from being 100% polished and fails to find /etc/geo-maps [16:11:12] but the basics are there I believe [16:11:24] hashar: thanks :) [16:12:41] ah and I got to rebase it bah :( [16:15:07] and fails for random reason. Sorry gotta polish a few more things :( [16:15:16] hashar: once I get a chance to look at it, I'll rebase up and then maybe integrate it into the main change itself [16:15:25] sure thing! [16:15:47] I made it as a different change so I can spam gerrit as needed and not interfere with the serious work [16:21:04] hashar: honestly, the part you're working on is the most-serious part. without that, we haven't solved the central issue, which is our inability to CI complex authdns stuff completely in our new world :) [16:21:28] ahhh [16:21:33] yeah that is a point of view :-D [16:21:52] what looks like serious stuff to me is coming up with all those magical ways to define the dns config [16:22:01] the stuff that I'm working on in the pre-commit is just to get it all in one repo, because your task would be almost impossible if it meant "you have to CI together matching unmerged changes from two different repos" [16:22:30] hmm [16:22:32] we can do that [16:22:37] using the Depends-On in commit message [16:22:47] eh still, it's complicated to do that [16:22:53] so you get puppet + change A and dns + change B [16:23:05] and we can make the jenkins job to run against A + B [16:23:28] (and then depends-on still doesn't really encode it for other purposes the way a straight up git descendant does) [16:23:29] but one still manually has to fill Depends-On: in the commit message [16:23:36] yeah [16:23:47] that works for mediawiki because nobody merges directly [16:23:58] and zuul will refuse to process a patch that depends on a change that is still open [16:24:20] so if you +2 A and B is open or did not get a +2, Zuul/CI refuses to process A entirely [16:24:45] as I got it your aim is to get everything in puppet, and that is probably easier to understand this way [16:24:55] will see. Will try to sprint a bit of it again tomorrow evening [17:02:10] 10Wikimedia-Apache-configuration, 10Wikidata, 10ArchCom-RfC (ArchCom-Approved): Canonical data URIs and URLs for machine readable page content - https://phabricator.wikimedia.org/T161527#3134622 (10GWicke) The description of the requirements seems to fit the REST API: - [API versioning](https://www.mediawik... [17:02:12] 10Traffic, 10Monitoring, 06Operations: Performance impact evaluation of enabling nginx-lua and nginx-lua-prometheus on tlsproxy - https://phabricator.wikimedia.org/T161101#3134623 (10ema) nginx-lua-prometheus uses a dictionary in [[ https://github.com/openresty/lua-nginx-module#lua_shared_dict | shared memo... [17:07:46] 10Wikimedia-Apache-configuration, 10Wikidata, 10ArchCom-RfC (ArchCom-Approved): Canonical data URIs and URLs for machine readable page content - https://phabricator.wikimedia.org/T161527#3134637 (10daniel) @gwicke REST URLs and canonical URIs are quite different conceptually, though it's nice when they coinc... [17:10:48] bblack: if you agree I'd like to split the topic in 2, what needs to be done during the MW switchover (in particular in the RO period) on one side and the traffic part on the other [17:11:37] ok [17:12:13] what's required, traffic-related for the MW switchover? something like last year commit or something different? [17:15:50] assuming we don't merge the big active/active change for the traffic-layer before apr 19, something like last year's commits [17:16:10] ok, if we instead merge the active/active? [17:16:19] how much it changes? [17:17:30] well, before answering that, can we step back and integrate the MW/RO stuff and the traffic stuff? it becomes relevant if we start asking that question (really, it's relevant anyways) [17:18:04] sure, what do you mean by integrate? [17:18:22] from the MW/RO sub-section only, what we're talking about is just switching "which DC's MediaWiki is the traffic layer talking to?" [17:18:48] whereas over the in the traffic-switching section, there's a couple of related sub-topics: [17:19:06] 1) Which DCs (plural) are end-users contacting the edge of, and: [17:19:23] 2) Which DC (singular) is handing the traffic<->appserver interface for MediaWiki [17:19:59] all 3 of these are independent decisions. for the whole switchover, we're trying to simulate "no eqiad", so that means 3 things: [17:20:08] 1) No users contacting the eqiad front edge directly [17:20:24] 2) Eqiad backend caches not involved in the interface from traffic<->appserver [17:20:35] 3) The MW/RO part: that the appservers traffic contacts are not in eqiad [17:21:33] If we just do (3) without changing (2), we end up in a PII-leak situation due to lack of TLS [17:21:52] sure [17:21:56] traffic would still be routing to/through the eqiad caches, and the eqiad caches would be contacting appservers.svc.codfw.wmnet directly without TLS [17:23:43] so, I think a year ago, we stepped through the MW/RO part first, which created the PII leak, and then we fixed up traffic routing (2 above) afterwards to fix PII [17:23:53] I think (1) we actually did ahead of the change (maybe 24h or more) [17:23:58] also if we do 1, do we plan to stay with it the whole period of the switchover? [17:24:06] yes [17:24:26] 1+2+3 would all be in effect for the whole switchover period [17:25:22] ok [17:25:32] so here's the other nugget of context to understand: [17:25:50] 1+2 above happen on a per-traffic-cluster level, whereas 3 happens on a per-application-service level [17:26:35] we can switch MW, API, Restbase, etc between DCs individually for (3), but the routing of users through eqiad for (1+2) happens for all of cache_text together (all those services) [17:26:57] meaning that 3 is MW & friends while the clustering of caches are different [17:28:44] well (3) applies separately to all applayer services everywhere from the traffic POV, but the "MW/RO" steps are the one complex example [17:30:01] anyways, all of that random context as input, if we were to merge the big traffic active/active change before Apr 19th, it fundamentally changes the above. [17:30:30] in that, the inter-cache routing and traffic<->appserver handoff point are no longer global to a whole traffic cluster like cache::text, they become per-application-service as well [17:31:34] so in that world, the whole perspective is fundamentally different [17:31:49] we still have (1) deciding to shut users out of eqiad front edge (probably a day before the switch and keeping it throughout) [17:32:34] the inter-cache routing table (which decides e.g. that esams caches ask eqiad caches for answers) is still per-traffic-cluster [17:33:15] and that's something we can change independently at the start of the whole procedure if we want (to ensure esams doesn't indirectly pass through eqiad to get to places) [17:33:39] but all the traffic<->applayer stuff is per-applayer-service (both where the interface happens and which side it contacts) [17:34:12] for an active/active service like RB, the "normal" state would be that codfw caches talk to codfw applayer, and eqiad caches talk to eqiad applayer. [17:34:42] when we want to disable eqiad for this switchover, it's a single commit to comment out the eqiad applayer for RB and push that around and we're done (no multiple steps here) [17:35:12] ditto for revert, single commit to restore. In both cases there's no PII leak and nothing else to worry about critically. [17:35:55] right. So I see 2 clear options: 1) agree on a deadline by which the decision of merging/no-merging of the active/active has to be made. 2) implement the automation steps for both cases, allowing for more time to land the active/active change [17:35:57] for a service with an RO period, we have multiple options under that scheme, but the simplest would be, from the traffic perspective, to temporarily go active/active during the RO period [17:37:03] so basically it's stepwise something like (1) Make MW RO (2) 1x commit pushed to traffic nodes, to make MW Active/Active (2) Second commit pushed to traffic nodes, dropping back to just codfw active (3) MW goes RW [17:37:35] the two seperate commits are to ensure we don't cause routing loops by making a sudden switch from eqiad-only to codfw-only in a single step for this case [17:38:14] ok [17:39:02] (it's cleaner I guess. although the routing loops would just generate quick 503s anyways) [17:40:27] on your clear options above: I'd rather go with (1), and make it this week [17:40:47] there's no point doing the extra work in parallel, or trying to merge a complex change just before the switch deadline [17:41:33] I see the double work an advance work for the next switchover where I guess we'll have it anyway (but maybe we'll have something even better, who knows) [17:41:44] apart from that yes, 1 is probably better [17:41:46] assume no-change on the traffic front for now. deadline end of week to make the switch to the new way. deciding to merge up the new way implies commitment to updating the procedure quickly too. [17:42:01] sounds like a plan! [17:42:02] :) [17:44:02] do we have the actual procedure we ran down last time somewhere? [17:44:12] (was it that wiki page, but now it's been edited?) [17:44:29] we had a whole global list of steps and sub-steps laid out sequentially [17:44:55] yes is that wiki page, that was edited then for the switchback [17:45:14] there is a separate section for traffic on the bottom [17:45:16] the perils of wikis! I guess can go grab it from edit history [17:45:59] looking for something in particular? [17:46:29] no, I just wanted to run over that for context a bit in my head, on proposed cmdline changes and such [18:12:02] 10Wikimedia-Apache-configuration, 10Wikidata, 10ArchCom-RfC (ArchCom-Approved): Canonical data URIs and URLs for machine readable page content - https://phabricator.wikimedia.org/T161527#3134885 (10GWicke) > However, URIs by nature should not include interface version information, because they identify the r... [18:18:07] 10Wikimedia-Apache-configuration, 10Wikidata, 10ArchCom-RfC (ArchCom-Approved), 06Services (watching): Canonical data URIs and URLs for machine readable page content - https://phabricator.wikimedia.org/T161527#3134911 (10GWicke) [18:31:08] volans: in the long run, I wonder whether per-service internal x-dc discovery is going to be a useful thing anyways. it's kind of a good step for now, but, it has potential issues because of latency multiplication anyways. [18:31:58] volans: e.g. imagine a world in which all the services are active/active. the normal state of affairs (via dns discovery) is that for service<->service traffic, everything is DC-local. [18:33:06] volans: there's so much room for latency-multiplier problems, if we for example depooled just the eqiad side of some_service to do maintenance on it or whatever. [18:33:55] volans: (e.g. MW, while processing a single user query, makes 34 separate network calls to $some_service. when we switch just $some_service to codfw-only, all the traffic coming through MW-eqiad that has that call pattern now gets unusubly-latent) [18:35:04] because under normal conditions it was 34x0ms, now it's 34x36ms = 1.2s to complete those subqueries [18:35:52] since we wouldn't observe the problem under normal conditions, it seems like even if we tried to avoid those sorts of patterns, they'll creep into ever-evolving code and only get noticed when we outage 1x server at 1x dc for "maintenance" or whatever [18:36:59] thinking about it from that angle, it seems saner to have all inter-service traffic always be DC-local, and not consider it acceptable to down a service for "maintenance" at 1x DC. [18:37:15] at which point we don't really need inter-service dns discovery [18:37:40] we can do whole-DC tests and maintenance without it [19:07:26] * volans back [19:11:13] bblack: yeah, it's full of this underlying dependency-related issues [19:19:08] 10Wikimedia-Apache-configuration, 10Wikidata, 10ArchCom-RfC (ArchCom-Approved), 06Services (watching): Canonical data URIs and URLs for machine readable page content - https://phabricator.wikimedia.org/T161527#3135095 (10Smalyshev) I think we have several concepts there that needs to be refined. 1. Canoni... [20:03:50] 10Wikimedia-Apache-configuration, 10Wikidata, 10ArchCom-RfC (ArchCom-Approved), 06Services (watching): Canonical data URIs and URLs for machine readable page content - https://phabricator.wikimedia.org/T161527#3135185 (10daniel) @GWicke: * I'd rather not use a REST URL as the URI. A good REST API exposes v... [20:05:34] 10Wikimedia-Apache-configuration, 10Wikidata, 10ArchCom-RfC (ArchCom-Approved), 06Services (watching): Canonical data URIs and URLs for machine readable page content - https://phabricator.wikimedia.org/T161527#3135190 (10daniel) [20:40:03] bblack: did another quick pass on the dns lint change https://gerrit.wikimedia.org/r/#/c/343747/5 [20:40:22] now it fails to load /usr/share/GeoIP/GeoIP2-City.mmdb which i guess is the maxmind private/licensed db [20:41:23] though somehow it is on the old jessie slaves .. [20:41:25] 10Wikimedia-Apache-configuration, 10Wikidata, 10ArchCom-RfC (ArchCom-Approved), 06Services (watching): Canonical data URIs and URLs for machine readable page content - https://phabricator.wikimedia.org/T161527#3135290 (10Smalyshev) > for commons data, we may not need a canonical object (concept) URI. Wel... [20:44:13] that is all for tonight *wave* [20:47:12] hashar: the old jessie slaves have "class authdns::lint" installed on the CI host itself, which installs some baseline things like the gdnsd packge and the GeoIP database (which are consider more CI-infra than evolving things to test) [21:12:45] 10Wikimedia-Apache-configuration, 10Wikidata, 10ArchCom-RfC (ArchCom-Approved), 06Services (watching): Canonical data URIs and URLs for machine readable page content - https://phabricator.wikimedia.org/T161527#3135370 (10daniel) >>! In T161527#3135290, @Smalyshev wrote: > Well, if we plan to refer to it in... [21:37:40] 10Wikimedia-Apache-configuration, 10Wikidata, 10ArchCom-RfC (ArchCom-Approved), 06Services (watching): Canonical data URIs and URLs for machine readable page content - https://phabricator.wikimedia.org/T161527#3135451 (10Smalyshev) > . In Wikidata, each entity has two URIs associated with it: the concept U... [22:39:21] 10Wikimedia-Apache-configuration, 10Wikidata, 10ArchCom-RfC (ArchCom-Approved), 06Services (watching): Canonical data URIs and URLs for machine readable page content - https://phabricator.wikimedia.org/T161527#3135562 (10daniel) @Smalyshev Oh, I was just trying to clarify the semantics of the URI. I wasn't... [23:45:19] 10Traffic, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 06Operations, and 2 others: Redo /beacon/impression system (formerly Special:RecordImpression) to remove extra round trips on all FR impressions (title was: S:RI should pyroperish) - https://phabricator.wikimedia.org/T45250#3135775 (10K...