[09:02:31] 10Traffic, 10Operations, 10Pybal, 10Patch-For-Review: Tune systemd journal rate limiting for PyBal - https://phabricator.wikimedia.org/T189290 (10ema) >>! In T189290#4053740, @ema wrote: > It would have been much more useful to get such messages into `journalctl -u pybal.service`'s output instead, and I do... [09:11:05] 10netops, 10Analytics, 10Analytics-Kanban, 10Operations, 10Patch-For-Review: Review analytics-in4 rules on cr1/cr2 eqiad - https://phabricator.wikimedia.org/T198623 (10elukey) Last changed applied by Arzhel, including merging common-infrastructure4 to analytics-in4 [10:56:14] 10Traffic, 10Operations, 10ops-esams, 10Patch-For-Review: cp3048 hardware issues - https://phabricator.wikimedia.org/T190607 (10MoritzMuehlenhoff) [10:59:08] 10Traffic, 10Operations: Setup wikimediafoundation.org domain for July 30 launch of new site - https://phabricator.wikimedia.org/T198922 (10MoritzMuehlenhoff) [11:07:44] 10Traffic, 10Operations, 10Wikidata, 10wikiba.se, 10Patch-For-Review: [Task] move wikiba.se webhosting to wikimedia misc-cluster - https://phabricator.wikimedia.org/T99531 (10abian) wikiba.se is a bit unstable. Today has been down for some hours (from ~1:00 UTC to ~5:30 UTC). Last issues were detected on... [12:38:41] 10Traffic, 10Operations, 10Pybal, 10Patch-For-Review: Tune systemd journal rate limiting for PyBal - https://phabricator.wikimedia.org/T189290 (10Vgutierrez) IMHO it would be great if those messages were in stderr instead of stdout otherwise you miss them when using journalctl + grep, that's why we missed... [12:53:00] (just noticed we were using ur1.ca as shortener, and it's kinda borked for TLS :) [12:53:18] tinyurl seems to be at least slight less unfriendly on general issues, although none of them are great [12:53:19] bblack: :_( I asked for that a few months ago :P [12:53:28] I don't remember that, sorry! [12:54:02] I only noticed at all because my client was offline for several hours and I wanted to check logs heh [12:54:02] bblack: BTW, I've synced with XioNoX, today at 16:00 UTC we will replace baham with authdns2001 [12:54:10] sounds great :) [12:54:32] the only thing missing in puppet is this CR: https://gerrit.wikimedia.org/r/444872 [12:55:13] probably should split that commit, IMHO [12:56:03] add the new one, test syncing works with the new set of 4, then do the router move (if something goes amiss, easy to undo work at the routing level without disrupting anything at the puppet or authdns-update level) [12:56:11] then pull baham out of the list after success [12:56:49] makes sense [13:02:21] 2 cents: might be good to give a voice to all SREs to avoid merging DNS patches during the migration ;) [13:06:20] volans: mail sent, thx :D [13:16:12] bblack: syncing'em it's mandatory.. cause authdns2001 has gone outdated/stalled since it has been installed but kept out of /etc/wikimedia-authdns.conf nameservers list [13:21:44] right [13:22:03] I forget if they all need a puppet run before the sync works across the 4, or just the one you're running the command on [13:24:10] well.. If I want to run the sync command from authdns2001, then the 4 of them, cause fw rules need to be updated [13:24:41] if I run it from one of the others.. then only in the one running it [13:32:13] will update the routers firewall rules before the window (add the new IP and not remove the old one) [14:11:01] vgutierrez: XioNoX: ok with moving the weekly meeting up a day to tomorrow? [14:11:12] (which means start writing etherpad updates now I guess!) [14:12:40] yep, but wont be able to write the updates before later today [14:13:13] win 25 [14:16:50] ok [14:28:14] bblack: I did some cleanup this morning (backends indentation, not defining vtc_backend 1M times). The diff might now be a bit more pleasant, although still pretty long https://puppet-compiler.wmflabs.org/compiler02/11755/cp1052.eqiad.wmnet/ [14:32:30] bblack: sure not problem :D [14:55:41] ema: looks nice! [14:55:53] ema: I had a passing unrelated thought while staring at that... [14:57:33] ema: we've traditionally used either chash, hash, or random as directors in various cases. Our norm (I think 100% now) is that varnish->varnish does the shard chashing stuff (except on pass and similar, when it uses random and should), and everything else (applayer) only has a single host to pick from and uses random [14:58:24] but something in the back of my head said "I wonder if Varnish even implements random very efficiently for the single-host case", and it doesn't [14:58:31] :) [14:58:40] it still runs an RNG and does some floating point calculations, even if there's only one host to pick from [14:59:17] round_robin is slightly better, but does modulo arthemetic (not bitmask, actual modulo) even in the one-host case [14:59:36] maybe fallback is more efficient? [14:59:42] yeah it seems the best of the bunch [15:00:20] s/arthemetic/arithmetic/ , apparently I'm still a coffee shy :) [15:00:52] this is probably one of those things where there's no real practical impact anyways, but it's also trivial to fix [15:01:24] yup! [15:01:42] I wonder if varnish's backend-vs-director polymorphism stuff actually allows us to have no director at all in this case? [15:03:22] you can just assing a backend to backend_hint without going through a director, if I understood your question correctly [15:04:05] https://varnish-cache.org/docs/trunk/users-guide/vcl-backends.html#multiple-backends [15:06:46] s/assing/assign/ (and I've had enough coffee) [15:10:23] if it makes a syntax difference or complicated the templating code there's little point vs fallback, but if it's easy we could just skip the director level entirely, yeah [15:13:04] right, if it doesn't complicate things too much I'll switch to setting the backend directly, otherwise s/random/fallback/ [15:14:32] bblack: so, unless you've noticed anything smelling particularly bad in the diff I'll go ahead and merge [15:14:52] who knows, maybe (rng+flops)*misspass_reqs is what pushes our backend threads' efficiency over the edge and causes mailbox lag :) [15:14:57] ema: +1 [15:18:00] https://www.ssllabs.com/ssl-pulse/ --> latest survey (July 3rd) shows a 6.8% drop in TLS 1.0 support, and a 2.9% drop in TLS 1.1 as well [15:26:18] lol at the 62 sites that still have Heartbleed [15:26:30] :) [15:26:43] apparently they didn't have a giuseppe ;) [15:27:17] !log merge alternate_domains vcl patch T164609 [15:27:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:21] T164609: Merge cache_misc into cache_text functionally - https://phabricator.wikimedia.org/T164609 [15:28:07] vgutierrez: I haven't seen much shift (even small disturbances) in our ciphersuite explorer the past couple of AES128-SHA changes. I still suspect that data is over-smoothed when zooming into shorter ranges like 1d/1w [15:29:20] hmmm well I was saying tha tbased on when I looked a day or two ago [15:29:30] now I zoom to 1w and do see some minor correlations! :) [15:30:13] at least, a small spike and some small moves, which show it isn't totally over-smoothed [15:35:46] it's kinda depressing.. users don't care about sec-warning :( [15:36:07] so... we just got lvs1015 wired :D [15:36:23] cool [15:36:31] and yeah, users :( [15:36:37] users have been trained to ignore sec-warnings [15:37:30] it's funny that the stats stay about level though, which probably implies either that some of them spam the reload button enough to make up for the others giving up, and/or most of the traffic isn't real anyways (some automaton in the background making silly requests nobody was looking at anyways) [15:38:13] 10netops, 10Analytics, 10Analytics-Kanban, 10Operations, 10Patch-For-Review: Review analytics-in4/6 rules on cr1/cr2 eqiad - https://phabricator.wikimedia.org/T198623 (10ayounsi) [15:38:38] 10Traffic, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293 (10Cmjohnson) lvs1015 idrac is setup, I think it's cabled correctly but I am not really sure, enp4s0f1 doesn't translate for me looking at h/w but I am pretty sure it matches... [15:38:46] semi-auto wikipedia fetches that users don't really care much about are a real thing on some devices. the case I remember clearly from a past transition was older Kindle ebook readers looking up words when users highlighted them. [15:39:20] 10netops, 10Analytics, 10Analytics-Kanban, 10Operations, 10Patch-For-Review: Review analytics-in4/6 rules on cr1/cr2 eqiad - https://phabricator.wikimedia.org/T198623 (10ayounsi) [15:39:44] either way, if the page is replaced with a warning 100% of the time and the stats don't move much, it doesn't look like real readers who are either walking away or using another device. [15:42:23] the alternate_domains stuff seems fine so far on pinkunicorn and cp3033 [15:42:59] I'll do some more testing elsewhere and the re-enable puppet on the remaining cache hosts [15:44:30] be careful with test traffic. I just tried one and realized the problems it causes. [15:44:59] I piped a request for https://phabricator.wikimedia.org/ through cp3033's text IP. I got the generic MediaWiki no such project sort of page. [15:45:08] < X-Cache: cp1065 pass, cp3040 miss, cp3033 pass [15:45:24] because phab is pass-mode on 3033 now with the patch, but elsewhere it may have cached (e.g. cp3040) as now the wrong output [15:46:21] yeah, so far I was just checking that text still worked fine [15:46:33] 10Traffic, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293 (10Vgutierrez) @Cmjohnson take into account that eth0 should be enp4s0f0, not enp4s0f1 :) BTW, would you mind checking the ethernet firmware version and update them if neede... [15:46:49] I'm getting 502 from nginx on local requests on 3033 now, maybe I'm overlapping some depool/restart of yours [15:47:21] bblack: what type of local requests exactly? [15:47:40] bblack@cp3033:~$ curl -v https://phabricator.wikimedia.org/ --resolve phabricator.wikimedia.org:443:91.198.174.192 [15:47:54] this is consistently giving an nginx 502 bad gateway error now [15:48:05] (after my one request that made it through earlier) [15:50:25] works fine with a similar request on cp1008 (using the correct text IP), but still shows a MediaWiki output [15:50:51] uh, there's a 'varnish frontend restarted' alert on cp3033, depooling [15:51:52] yeah I suspect my 502s (which happen after a short timeout) are crashing it [15:52:21] I'll stopp testing and leave you to it :) [15:57:26] 10netops, 10Analytics, 10Analytics-Kanban, 10Operations, 10Patch-For-Review: Review analytics-in4/6 rules on cr1/cr2 eqiad - https://phabricator.wikimedia.org/T198623 (10ayounsi) I added the IPv6 equivalent of the v4 filter with a default "log+permit" term, so we can see if we missed anything. 3 highlig... [16:07:01] authdns2001 synced from radon.w.o :D [16:07:16] How many time do I have to opt-out Equinix surveys to stop receiving them? [16:10:12] bblack: ok so, two problems: [16:10:22] (1) certain requests managed to crash varnishd [16:10:22] Assert error in VCL_Ref(), cache/cache_vcl.c line 296: [16:10:25] Condition(!VCL_COLD(vcl)) not true. [16:11:45] (2) restarting the daemon doesn't work, VCL does not compile because the wikimedia_misc label isn't available at startup yet (it is loaded by reload-vcl, but that can be done only after varnishd starts!) [16:13:04] reverting for now [16:14:30] 10netops, 10Analytics, 10Analytics-Kanban, 10Operations, 10Patch-For-Review: Review analytics-in4/6 rules on cr1/cr2 eqiad - https://phabricator.wikimedia.org/T198623 (10Addshore) The WMDE scripts have requests going to the following places not via the webproxy: - https://noc.wikimedia.org/conf/dblists/... [16:16:48] the assertion seems possibly philosophically faulty to me (referencing a cold VCL shouldn't be a code assert?), but likely it boils down to something wrong with our patterns of loading/reloading/discarding/etc [16:17:18] but yeah, we'll probably need to get the misc VCL loaded at initial startup too :) [16:17:30] hehe it would be nice yeah [16:17:48] I guess it's probably just some more cli flags, hopefully they thought about this [16:18:04] it worked on my machine (TM) though [16:20:55] I suspect we might have to start the daemon with no vcl, then load/label the alternate, then load/use the main one [16:22:48] 10netops, 10Analytics, 10Analytics-Kanban, 10Operations, 10Patch-For-Review: Review analytics-in4/6 rules on cr1/cr2 eqiad - https://phabricator.wikimedia.org/T198623 (10elukey) >>! In T198623#4412494, @Addshore wrote: > The WMDE scripts have requests going to the following places not via the webproxy: >... [16:23:26] 10netops, 10Analytics, 10Analytics-Kanban, 10Operations, 10Patch-For-Review: Review analytics-in4/6 rules on cr1/cr2 eqiad - https://phabricator.wikimedia.org/T198623 (10ayounsi) And more redundant, as query.wikidata.org and wikidata.org are load balanced. [16:25:09] labels don't help at all.. but you can see here https://grafana.wikimedia.org/dashboard/db/dns?orgId=1&from=now-30m&to=now&panelId=2&fullscreen baham losing DNS traffic and authdns2001 gaining it [16:26:16] I'm going to get rid of baham.w.o from the nameservers list now :) [16:26:18] thx XioNoX <3 [16:39:33] someday when we all have more free time, it'd be interesting to take some sniffed samples on what those 7% of reqs that are nxdomain/refused are from (some might just be dumb typos or misconfigs in our own infra, or domains markmonitor has pointed at us but we haven't bothered to even set up a parking zonefile for) [16:40:14] bblack: noted :D [16:41:24] bblack: tomorrow I'll reimage baham if authdns is able to survive the EU night :) [16:41:30] *authdns2001 [16:41:34] vgutierrez: while you're making notes! :) ... https://grafana.wikimedia.org/dashboard/db/dns?orgId=1 still doesn't show units for the Y-axis, and I still think they're actually confusingly reqs/5min rather than reqs/sec. should be either converted to /sec or labeled [16:42:06] probably we should move that dashboard to prometheus as well O:) [16:42:11] that too :) [16:44:56] bblack: IIRC, you want to wait till lvs1013-lvs1015 are all ready to make the switch [16:45:18] bblack: so I'll image tomorrow lvs1015 as an spare system, and I'd check that the ethernets are behaving as expected and so on [16:45:57] I already asked to chris/rob to update the ethernets FW there, and it's done :) [16:46:29] so it should be as easy as massage a little bit the BIOS to set the desired config and voila :D [16:50:24] I think, IIRC from last time this came up, the timelines for hooking up all the ports on 13 and 14 may be long, so we might want to start using 15 earlier. [16:51:03] 15 is the one that actually should be, in the final config, primary for "low-traffic" (what we have 16 doing now) [16:51:31] so once it's tested, when we have time, we can do another low-traffic migration from 16 to 15, and then repurpose 16 to its final role as a backup for all others. [16:51:55] (and then we can decom all of the old ones except lvs1001 + lvs1002, pending 13 and 14 going live to replace them) [17:08:40] 10Traffic, 10Operations, 10Wikimedia-General-or-Unknown: Expire cache after Wikipedia copyright protests - https://phabricator.wikimedia.org/T199252 (10Nemo_bis) [17:08:53] 10Traffic, 10Operations, 10Wikimedia-General-or-Unknown: Expire cache after Wikipedia/Wikimedia copyright protests - https://phabricator.wikimedia.org/T199252 (10Nemo_bis) [17:11:28] 10Traffic, 10Operations, 10Wikimedia-General-or-Unknown: Expire cache after Wikipedia/Wikimedia copyright protests - https://phabricator.wikimedia.org/T199252 (10BBlack) wgCacheEpoch is probably about the parser cache, which is separate from #Traffic 's Varnish caching. Either one could be an issue here, or... [17:19:28] 10Traffic, 10Operations, 10Wikimedia-General-or-Unknown: Expire cache after Wikipedia/Wikimedia copyright protests - https://phabricator.wikimedia.org/T199252 (10matmarex) The redirects (that I know of) were implemented using JavaScript code in MediaWiki:Common.js etc, for example: * https://it.wikipedia.org... [17:58:38] 10Traffic, 10Operations, 10Wikimedia-General-or-Unknown: Expire cache after Wikipedia/Wikimedia copyright protests - https://phabricator.wikimedia.org/T199252 (10Daimona) Several users have reported this problem, however they weren't really redirected: instead, while searching stuff on google, //google redir... [18:14:16] https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#Sec_Warning_coming_up_on_Chrome - perhaps the message needs to be tweaked to mention it's not necessarily a browser problem? [18:18:13] 10Traffic, 10Operations, 10Performance-Team, 10Wikimedia-General-or-Unknown: Expire cache after Wikipedia/Wikimedia copyright protests - https://phabricator.wikimedia.org/T199252 (10Daimona) I think we already purged common.js several times after the blackout, anyway let's see if it works. As for Google, I... [18:42:33] 10Traffic, 10Operations, 10ops-eqsin: cp5006 unresponsive - https://phabricator.wikimedia.org/T187157 (10RobH) After discussion with @Cmjohnson its been decided we'll go ahead and attempt to get the mainboard replaced before doing the smarthands work i suggested above. @papaul was onsite and did the steps:... [18:51:37] 10Traffic, 10Operations, 10ops-eqsin: cp5006 unresponsive - https://phabricator.wikimedia.org/T187157 (10RobH) {F23557620} Self dispatch SR971650695 scheduled, including a request for an onsite technician. Once they send me the shipping info, I'll open an inbound shipment ticket with eqsin. I'll then sch... [18:51:51] 10Traffic, 10Operations, 10Performance-Team, 10Wikimedia-General-or-Unknown: Expire cache after Wikipedia/Wikimedia copyright protests - https://phabricator.wikimedia.org/T199252 (10Krinkle) >>! In T199252#4413493, @Daimona wrote: > As for Google, I don't have a link but I can ask for it if you want. What... [19:42:15] 10Traffic, 10Operations, 10Performance-Team, 10Wikimedia-General-or-Unknown: Expire cache after Wikipedia/Wikimedia copyright protests - https://phabricator.wikimedia.org/T199252 (10BBlack) Was the "temporary" JS redirect a 301 perhaps? [19:48:30] 10Traffic, 10Operations, 10Performance-Team, 10Wikimedia-General-or-Unknown: Expire cache after Wikipedia/Wikimedia copyright protests - https://phabricator.wikimedia.org/T199252 (10Krinkle) >>! In T199252#4413739, @BBlack wrote: > Was the "temporary" JS redirect a 301 perhaps? Nope, it wasn't any form of...