[00:00:57] 10Traffic, 10Maps, 10Maps-Sprint, 10Operations, and 2 others: Decide on Cache-Control headers for map tiles - https://phabricator.wikimedia.org/T186732#3969931 (10jmatazzoni) [00:01:10] 10Traffic, 10Discovery, 10Maps, 10Maps-Sprint, and 3 others: Decide on Cache-Control headers for map tiles - https://phabricator.wikimedia.org/T186732#3969935 (10jmatazzoni) [08:52:11] 10HTTPS, 10Traffic, 10Discovery, 10Operations, and 3 others: announce breaking change: http > https for entities in rdf - https://phabricator.wikimedia.org/T154015#3971009 (10Smalyshev) p:05Low>03Lowest [08:52:32] 10HTTPS, 10Traffic, 10Discovery, 10Operations, and 2 others: compile number of http uses for http://www.wikidata.org/entity - https://phabricator.wikimedia.org/T154017#3971010 (10Smalyshev) 05Open>03stalled p:05Low>03Lowest [08:52:36] 10HTTPS, 10Traffic, 10Discovery, 10Operations, and 2 others: Consider switching to HTTPS for Wikidata query service links - https://phabricator.wikimedia.org/T153563#3971012 (10Smalyshev) [11:00:03] 10Traffic, 10Discovery, 10Maps, 10Maps-Sprint, and 3 others: Decide on Cache-Control headers for map tiles - https://phabricator.wikimedia.org/T186732#3971323 (10ema) Anything else left to be discussed here? [11:23:21] 10Traffic, 10Discovery, 10Maps, 10Maps-Sprint, and 3 others: Decide on Cache-Control headers for map tiles - https://phabricator.wikimedia.org/T186732#3971360 (10Pnorman) Because cache-control is in one of the sample config files, we should make sure it's something sensible, even if we use something differ... [11:29:50] 10Traffic, 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Ops Onboarding for Valentín Gutiérrez - https://phabricator.wikimedia.org/T187035#3971381 (10MoritzMuehlenhoff) Valentín has been added to pwstore. [11:30:12] 10Traffic, 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Ops Onboarding for Valentín Gutiérrez - https://phabricator.wikimedia.org/T187035#3971382 (10MoritzMuehlenhoff) [11:47:33] so, pybal's etcd.py [11:47:47] I'm trying to figure out how it works (bad idea, I know) [11:48:40] getMaxModifiedIndex is what puzzles me at the moment [11:48:43] https://github.com/wikimedia/PyBal/blob/master/pybal/etcd.py#L151 [11:49:15] I don't think we've got "nodes" in our etcd responses [11:52:05] if 'nodes' is not there the for loop doesn't do anything, so I think you're right [11:53:49] also, my understanding is that you'd wait for a given index to make sure you don't miss any updates, but waiting for response['modifiedIndex'] + 1 seems pointless? [11:59:04] oh, but modifiedIndex as returned from etcd is not the "last index" but rather the index of whatever update you were waiting for [12:00:51] ok I guess what I don't understand is: why exactly are we using waitIndex? To protect against which type of failures? [12:03:54] surely not network issues given that we reset the index to None in case of connection failed/lost [12:07:03] hmmm in that case we shouldn't begin with X-Etcd-Index? [12:07:14] (https://coreos.com/etcd/docs/latest/v2/api.html) [12:08:43] yeah there is that too I think [12:09:48] more in general though, in our use case we care about the current state of the data in etcd, not whatever happened in the past [12:10:08] if etcd says a host is pooled, *now*, then it needs to be pooled, now [12:10:24] regardless of whether we missed a depool event in the past or not [12:11:23] it would actually be wrong to depool the host/service now because a few updates ago (which we've missed) it was depooled! [12:11:36] yeah, we don't have to reply all the actions if we missed some [12:11:43] *replay [12:12:04] not only we don't have to, we shouldn't :) [12:12:39] yeah read mine as we must not :D [12:25:04] so well then the whole thing is just about passing `wait=true` to all requests except for the first one [12:26:02] without any waitIndexes, etcdIndexes, and the like [13:28:16] hi :) [13:28:37] it's always such a bad idea to actually look at the code :) [13:28:44] LOL [13:29:00] vgutierrez: how goes your onboarding? :) [13:29:11] snail pace :( [13:29:15] figured [13:30:30] https://phabricator.wikimedia.org/T187035 [13:31:07] XioNoX: can hep you with network device access off that list, just bug him to add your ssh key everywhere :) [13:31:30] I'm not sure about racktables, I suspect it's some traditional password hashes stored in our private repo? [13:31:44] and then there's a bunch of email stuff there [13:31:51] and... icinga [13:32:15] yup.. I was waiting till PDT office hours to bother mutante [13:32:27] yeah you learn to hate timezones here :) [13:32:50] the admin credentials for racktables are in pwstore, I think you can also add yourself now [13:32:59] ah ok [13:33:02] lovely :) [13:34:06] on the functional side, we've gotta get some efficient brain-dumping going from me->you on basically the history of and where we stand now on all things TLS-related, and known near-to-medium-term upcoming bits. [13:34:28] it will take time, but we gotta start somewhere :) [13:34:50] wow.. I just realised I worked with one of pws co-authors for two years [13:34:57] bblack: sure :D [13:35:31] also, another missing piece of the puzzle is telling the rest of the org that you're here, outside of a few meetings it's been mentioned in and IRC [13:35:51] no more ninja mode? :( [13:36:18] we usually do a new employee announcement email that has a certain canonical form, I'll get with you on the details shortly in private, I need to resync my coffee mug with its dispenser and take care of a few other odds and ends first. [13:36:45] sure, nothing properly works without caffeine [13:39:20] bblack: no racktables is it's own special snowflake, but I can take care of it if you want (anwyone can really) [13:40:18] volans: let me do it :) [13:40:29] according to moritzm I should be able to do it myself [13:40:46] vgutierrez: let me correct, anyone that has already an account ;) [13:41:07] hmm I can abuse admin account :) [13:41:11] to create mine [13:41:28] didn't know we had a global admin one :D [13:41:46] ok then I can guide you if you want, the UI is not very friendly [13:42:30] not sure if it's documented anywhere [13:42:35] if not we should add it [13:42:48] although racktables should be decommes $soon [13:42:58] ema: btw where we at on v5, still pending upload@eqiad? we should start text in esams soon, as it may change/impact whatever mystery with the 160s thing as well. [13:43:48] right, we have netbox as a pending thing that will eclipse racktables, but not quite there yet [13:44:39] ema: my other queued Q for the morning: do you know what the units are in the new objects purge rate graphs? objs/sec? [13:45:07] ema: or is it actually reporting the fraction of objs_purged/purges_recvd ? [13:46:05] <3 for insisting on units in graphs [13:47:17] I think now that I've scrolled around some graphs, it kinda has to be the latter or the data wouldn't make sense [13:56:17] 10Traffic, 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Ops Onboarding for Valentín Gutiérrez - https://phabricator.wikimedia.org/T187035#3971759 (10Vgutierrez) [13:56:34] racktables sorted out :) [14:28:36] 10Traffic, 10Operations, 10Patch-For-Review: Configuration for Asia Cache DC hosts - https://phabricator.wikimedia.org/T156027#3971859 (10BBlack) Remaining known stuff, paring down the earlier list: ``` * hieradata/common/cache/*.yaml: cp5006 + cp5010 commented out (borked) * External monitoring stuff in:... [14:30:36] bblack: is prometheus in eqsin in scope for ^ or another task ? [14:31:01] godog: yeah, probably should be :) [14:31:19] godog: bast5001 has the roles for it and the basic hieradata is there already I think, but I donno what other setup... [14:31:43] hieradata/eqsin.yaml has prometheus_nodes already [14:32:32] bblack: oh ok, then it should be already DTRT, I'll double check [14:32:57] is there somewhere else we need to configure bast5001 to integrate it with global stats and grafana and such? [14:33:47] adding it as a datasource in grafana, the rest should be already in place [14:34:14] sadly we don't manage datasources in puppet for grafana, so it is a point and click affair [14:34:50] bblack: hey! So, the metric plotted there is [14:34:52] MAIN.n_obj_purged 1 . Number of purged objects [14:35:08] and we plot the usual prometheus rate[5m] [14:36:05] which would be the per-second rate measured over the last 5 minutes [14:37:00] so no, it is not the fraction of purged objects over purge received but rather the number of purged objects/sec [14:37:27] it seems to have a value that looks like that, though :) [14:37:30] it's odd [14:37:36] 10Traffic, 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Ops Onboarding for Valentín Gutiérrez - https://phabricator.wikimedia.org/T187035#3971876 (10Vgutierrez) [14:37:47] by that I mean, the number seems to hover in the 0.N ranges and never reach 1.0 [14:38:02] ok [14:39:09] nice, I see eqsin popping up in varnish-machine-stats :) [14:39:20] hehe yeah I added the datasource just now [14:39:33] everything else seems to be in place already, big win [14:42:10] bblack: the 0-1.0 thing seems just a coincidence, see eg: cp3040 [14:42:13] https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=67&fullscreen&orgId=1&var-server=cp3040&var-datasource=esams%20prometheus%2Fops [14:42:52] ah [14:45:19] definitely feeling the latency reaching eqsin from esams -> eqiad -> codfw [14:45:44] does it affect grafana graph views too? [14:45:57] (does the data actually pull from bast5001 when rendering?) [14:46:27] yeah it does pull from bast5001 for site-local data [14:46:43] re: ssh access, probably best to add to .ssh/config and use bast5001 to reach *.eqsin.wmnet (and not use bast3 to reach bast5) [14:46:57] but we'll add that to wikitech at some stage of readiness [14:47:17] bblack: oh, and re:upload@eqiad, starting now! :) [14:48:30] heh I don't know, I'm usually happier to use our backbone instead of the internet but of course in case of trouble that'd be correct [14:49:27] that's probably generally true, although I don't know if it will hold in this case [14:49:48] bblack: I'm using iron (yubikey) from italy and it's still acceptable, nothing compared to when I was ssh-ing into syndey from dublin in a past life ;) [14:50:18] can someone try from your random place in the EU, direct ping to bast5001, vs ping time to bast3002? [14:51:17] I'm getting ~270ms to bast5001 and ~40 to bast3002 [14:51:26] 255.194/309.333/377.081/50.682 ms VS 33.973/34.253/34.694/0.315 ms [14:51:26] it's about 330ms inside our network from bast3->bast5, so the question is whether bast5 direct is 330ms worse over th einternet than bast3 [14:51:54] 5001 272.726/345.953/538.723/82.279 ms | 3001 36.032/37.274/39.650/1.063 ms [14:51:59] right, so for your case, if you put bast5 in ssh config it will cut you from 370ms to 270ms [14:52:24] 5001: 268ms - 3002: 33ms [14:52:55] yeah they all seem (to varying degrees) to support the idea that you'll reach eqsin faster going over the internet to it than over our network from EU [14:53:46] do the traces run westwards like our network, or go west over asia? [14:53:53] err, "east over asia" [14:54:26] in-network bast3<->bast5 is: 330.935/331.015/331.324/0.730 ms [14:54:45] good question, I get a 220ms jump between mei-b2-link.telia.net and snge-b1-link.telia.net [14:55:41] mississippi? [14:55:42] me too [14:56:35] yeah mei must be MS [14:57:04] I guess it's just a faster jump, if telia is hopping you as EU->MSUS->SG [14:57:10] interesting, from a VPS I have access in italy, is still telia, same path, but much less latency [14:57:18] vs our esams->eqiad->codfw->sg links adding up [14:57:24] bast5001 163.659/163.692/163.744/0.405 ms [14:57:43] still mei-b2-link.telia.net 14ms -> snge-b1-link.telia.net 176 ms [14:58:43] QoS? [15:02:16] no idea [15:12:15] grrr [15:12:24] ! [remote rejected] HEAD -> refs/for/production/varnish5-upgrade (the number of pushed changes in a batch exceeds the max limit 10) [15:12:32] guess how many I was trying to push? [15:12:59] 11 [15:13:12] correct! [15:13:24] murphy's ftw [15:14:07] but you're cheating, one patch per host... it's just a trick to be the puppet committer of the month :D [15:15:26] it's just a trick to lighten up my github profile really [15:22:23] you could probably automate around it with a push hook [15:23:11] (a push hook that detects >10 commits, then does multiple separate pushes in chunks of 10) :) [15:25:47] ema: btw watch for possible collision of upload@eqiad upgrade restarts with godog rolling thumbor upgrade depools. possible we create a load issue between the two? [15:26:11] although I guess swift should mediate it a bit [15:27:05] oh yeah I hadn't noticed godog's upgrades [15:27:13] godog: let me know when you're done! [15:27:22] good point, I'm going codfw first [15:27:59] but yeah I don't think varnish notices, better safe than sorry heh [15:32:36] ema: eqiad now, ETA 10m [15:33:52] godog: ok, go ahead! [15:40:37] ema: {{done}} [15:41:36] godog: great, FYI we've had a pretty steep increase in requests to swift after the upgrade of cp1048, it seems to be going back to normal now as expected [15:42:43] ema: ack, yeah looks like about 1k/s [16:14:09] godog: ms-be are busy machines! :) [16:14:09] yeah no kidding, lotsa spindles [16:14:10] interesting how top3 load now on ms-be is roughly the same as 24h ago, while it has increased on ms-fe a bit [16:15:10] 10Traffic, 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Ops Onboarding for Valentín Gutiérrez - https://phabricator.wikimedia.org/T187035#3972241 (10RobH) >>! In T187035#3969514, @Dzahn wrote: > @Robh could you do one more Racktables user? thanks! someone beat me to this, he is already setup! =] [16:16:23] 10Traffic, 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Ops Onboarding for Valentín Gutiérrez - https://phabricator.wikimedia.org/T187035#3972246 (10Vgutierrez) >>! In T187035#3972241, @RobH wrote: >>>! In T187035#3969514, @Dzahn wrote: >> @Robh could you do one more Racktables user? thanks! >... [16:32:33] _joe_: so, earlier today I was looking into pybal's etcd module. Am I right in saying that waitIndex is useful to get changes that have happened in the past and that might have been missed by the etcd client? If so, why are we using that feature in pybal? We should only care about the latest values in etcd when it comes to pybal, shouldn't we? [16:33:18] <_joe_> yes [16:33:26] <_joe_> ema: no [16:33:48] <_joe_> not necessarily, it's not that simple [16:33:54] <_joe_> but I'm in a meeting sorry [16:34:51] _joe_: ok, let's discuss this when you have time :) [16:35:38] <_joe_> yeah sorry [17:00:24] volans: cumin commited seppuku on me https://phabricator.wikimedia.org/P6699 [17:00:41] ema: checking [17:00:58] thanks [17:01:01] ema: seems that puppetdb didn't reply [17:01:19] Active: active (running) since Wed 2018-02-14 16:58:26 UTC; 2min 48s ago [17:01:31] ema: it's puppetdb that committed seppuku [17:02:03] [Wed Feb 14 16:58:06 2018] Out of memory: Kill process 27758 (java) score 355 or sacrifice child [17:02:17] ema: retry now, should work [17:02:55] great timing puppetdb! [17:03:27] so you got exactly in those 20s it was down [17:03:33] way to go ema! :D [17:04:15] ema: btw I got another thumbor deployment coming up, LMK when I can go [17:06:44] godog: I've just finished upgrading cp1050 now, go ahead whenever you think that the load on swift is acceptable [17:07:00] ema: ok! [17:28:44] ema: almost done with thumbor/swift, one last rolling restart [17:29:06] nice [17:42:51] ema: I'm done with rolling restarts [17:43:00] \o/ [20:49:27] upload@eqiad almost done, just one host to go (cp1099) [20:50:27] I've kept traffic to swift below 3k req/s when spacing the upgrades, which I think is conservative enough when looking at the impact on ms-be/fe load [20:51:22] only the very first upgrade (cp1048) caused traffic to go beyond that threshold almost immediately [20:51:37] without any consequences though :) [20:53:12] wmf-upgrade-varnish is very useful and reduces lots of toil, but still requires one patch per host toggling the hiera settings, and one puppet-merge per host [20:53:37] hence I've added --hiera-merged https://gerrit.wikimedia.org/r/c/410558/ [20:54:49] so the idea is that we can disable puppet on a bunch of hosts, perhaps the whole cluster, merge a single puppet change toggling use_experimental and varnish_version, and run the script once per host [20:55:53] the downside of course is that we'd have to keep puppet disabled during the whole upgrade process, which usually lasts ~1/2 day [20:59:20] heh I wanted to add volans to the CR but he gets automatically added to anything that has to do with python, even remotely :) [21:02:04] :-P [21:03:13] volans: thanks! :) [21:03:27] yw [21:03:30] was a quick one [21:04:18] ema: the other option could be that the --hiera-merged accept a value that is the message used when puppet was disabled [21:04:23] might be cleaner, up to you [21:05:17] oh yeah that's a good idea [21:06:30] we should probably bail out immediately if the message passed to --hiera-merged and the one used to disable puppet differ [21:07:03] at that point you just pass it like you do now [21:07:10] and run-puppet-agent will fail [21:07:19] to reenable it [21:08:30] it would, but it's better to avoid depooling the host and taking any other action if that's the case [21:09:44] right, that check is way later [21:12:03] I'm gonna make those changes tomorrow and upgrade eqsin with the latest version of the script then :) [21:12:34] sounds like a plan! [21:14:25] alright! all upload clusters upgraded, except eqsin :) [21:15:29] hitrate in eqiad slowly recovering, we're looking good [21:28:45] you're here way too late in the day! :) [21:32:42] it's a pleasant distraction from real life :) [21:32:58] fair enough! [21:51:44] the encoding rabbithole is deep :P [21:52:12] (well, especially for restbase) [21:53:04] did you know that ?action=purge on some pages generates 32 PURGE requests to the caches? :) [21:53:32] I think I can find other dases that possible do a lot more [21:57:36] so apparently (I didn't realize until today, in spite of so many past conversations/tickets about related things) [21:57:51] RB and MW actually differ substantially in what they consider to be the canonical encoding of a title string [21:58:14] MW has some well-defined rules that are mostly-documented and fairly easy to follow [21:58:49] RB's canonical title encoding is apparently whatever happens when you take MW's normalized encoding (or worse, some other random mis-normalized encoding) and pass it through Javascript's stock encodeURIComponent() call [21:59:30] (which, in spite of being apparently a language-level primitive, doesn't seem to really follow the strict RFC rules on what to encode, either. Maybe it wasn't meant for whole urls or for paths, as indicated by the "component" in the name... [21:59:34] ) [21:59:57] luckily most reasonable encodings functionally "work" anyways, so our past encoding efforts never broke anything. [22:00:12] it just leads to duplicated cache contents and possibly stale content (when canonical isn't really canonical) [22:00:51] then to make matters funner, when RB paths are PURGEd, it sends duplicate purges in MW's canonical form and its own [22:00:58] e.g. back to back: [22:00:59] - ReqURL /api/rest_v1/page/mobile-sections-remaining/Skadden%2C_Arps%2C_Slate%2C_Meagher_%26_Flom [22:01:02] - ReqURL /api/rest_v1/page/mobile-sections-remaining/Skadden,_Arps,_Slate,_Meagher_%26_Flom [22:01:24] (luckily, it doesn't iterate all possible encoding variants, e.g. only 1-2/3 of the commas encoded and not the others) [22:02:24] so there's encoding-duplication in the purge count, as well as RB of course having mobile-sections and all that, and past revs. [22:02:44] the best example I've seen yet, the total purge count for 1x ?action=purge was 40 [22:04:46] I wonder if RB's documentation declares anything about this, i.e. what its canonical encoding form really is. [23:54:33] 10Traffic, 10DC-Ops, 10Operations, 10ops-eqsin: singapore caching center: eqiad staging tracking task - https://phabricator.wikimedia.org/T166179#3974137 (10RobH) [23:57:44] 10Traffic, 10Operations: Server hardware purchasing for Asia Cache DC - https://phabricator.wikimedia.org/T156033#3974182 (10RobH) [23:57:51] 10Traffic, 10DC-Ops, 10Operations, 10ops-eqsin: singapore caching center: eqiad staging tracking task - https://phabricator.wikimedia.org/T166179#3974185 (10RobH) [23:57:55] 10Traffic, 10Operations: Network hardware purchasing for Asia Cache DC - https://phabricator.wikimedia.org/T162683#3974186 (10RobH) [23:57:58] 10Traffic, 10DC-Ops, 10Operations, 10ops-eqsin: singapore caching center: eqiad staging tracking task - https://phabricator.wikimedia.org/T166179#3287561 (10RobH)