[00:29:56] heh and now less than 2h after it missed its daily restart, cp1074 is already starting to fall behind on its mailbox [00:30:06] not by much yet, but you can see the pattern developing [00:43:51] 88 [08:49:17] hello! Anything significant happened yesterday at around 2016-9-20 15/16 UTC? [08:49:42] I have some errors with data consistency checks and I wanted to verify (didn't get any clue looking at the varnish dashbaords) [08:55:59] I can see that vk is a bit stressed in upload due to the restarts - https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=10&fullscreen [08:56:09] so this is surely influencing the issues that I am seeing [08:56:13] the new patch should help [09:02:47] elukey: https://grafana.wikimedia.org/dashboard/db/varnish-aggregate-client-status-codes [09:03:00] filter for site: All, cache_type: upload, status_type: 5 [09:03:53] if you go for 'Last 24 hours' you'll see a buncho of short 503 spikes, two of which happened at 15:00 and 16:00 [09:05:46] then if you go by exclusion choosing one DC at a time you can see which DC was affected [09:05:49] esams in this case [09:06:16] 15:27 [09:06:38] and then again esams 16:05 [09:07:11] ulsfo 15:28 16:04 and 16:06 [09:07:47] thanks :) [09:07:50] these are very short spikes, lasting ~1 minute, the interesting part is that they happened roughly at the same time in two different DCs [09:08:14] I tried to do the same but since there were other spikes later on I wasn't sure [09:09:13] (I am checking the last graph of the dashboard) [09:09:51] right, you can also use the direct link to it since it's the most interesting graph IMHO for this stuff [09:09:55] https://grafana.wikimedia.org/dashboard/db/varnish-aggregate-client-status-codes?panelId=2&fullscreen [09:11:17] nice, I'll forward it to my team [09:11:29] weird that we have data alarms only for some spikes [09:12:05] elukey: are the alarms about cache_upload? [09:12:13] yeah [09:12:17] right [09:12:34] well vk could have crashed right [09:12:46] and then restart again at the next puppet run [09:13:22] I am going to do the last tests for https://gerrit.wikimedia.org/r/#/c/311415/6 then I'd like to merge, build and test it in upload [09:13:25] what do you think? [09:13:33] sounds like a great plan [09:13:34] maybe not all the hosts in once [09:13:37] :D [09:13:38] how about Restart=always? [09:14:04] can we keep that option if the patch doesn't work? [09:14:09] just to see the differences [09:14:14] sure [09:14:18] super [12:58:40] those little spikes of 503 are the cron restarts [12:59:10] they happen at roughly the same time in multiple DCs when the restarting node is in codfw or eqiad, since the fallout goes through the upper DCs as well [12:59:50] I increased the sleeps in the backend restart script and the spikes got even smaller, but I doubt it's possible to get it perfect with reasonable sleep values [13:12:32] oh that makes sense [13:14:37] well another approach would be waiting for actual requests to stop coming in instead of sleeping [13:15:28] something like varnishncsa with the right filters to exclude PURGE and /check, with a timeout [13:16:39] not sure if it's worth the effort though :) [13:17:45] yeah [13:18:11] my suspicion is the depool and repool are happening plenty fast for the sleeps, but the problem is lingering long transactions [13:18:28] (tcp connections that were already open and flowing on a long transfer for a slow client or whatever) [13:20:45] the important thing is the little 503 spikes represent real failures, but they're tiny and by the time it fails the failing thing is depooled, so any retry by the client should succeed the second time [13:20:59] single-digit per second for 1-2 samples or whatever [13:47:11] bblack: varnishd's CPU usage on cp1074 is much higher than on cp1099 [13:47:33] ~230% vs. ~120% at the moment [13:47:53] yeah [13:48:06] it's starting to persistently backlog on mailbox now, whereas last night it was just intermittent [13:48:53] it's ~600K mails behind, it will probably fail when it gets a bit past ~4.2M mails behind [13:49:31] also notable for this experiment's purposes: cp1099 has fewer CPU cores and less ram than cp1074 [13:49:41] yeah I was about to say [13:50:08] at this point I can actually randomly-catch cp1099 with small backlogs too, but they're generally <100 backlogged and tends to catch back up quickly [13:50:26] so it does feel the same pressure, it's just "better" [13:50:38] which sounds promising :) [13:50:54] yeah [13:54:55] uh, trying 4 hit wonder on text too? Exciting [13:56:33] the v4 conditional is not particularly useful as of yet, but we're gonna need it next quarter anyways [13:59:09] yeah I've been using it in text VCL in general, in some of the newest changes there [13:59:18] figure a few less things to deal with on transition [13:59:23] nice [14:00:58] so the storage transition is slightly-tricky [14:01:37] since the storage has to be rm'd between varnishd stop->start to get the old files out of the way, but the VCL needs to be right when it comes back up, but will fail compilation before it goes down [14:01:45] i saw a backlog of ~6 objects earlier [14:02:02] mark: on which? [14:02:06] cp1099 [14:02:19] yeah I've seen it up to ~10-20 or so, but only briefly [14:03:30] so the transition plan to get past all that without spamming pointless puppetfails is basically: [14:03:48] 1) Disable puppet on the node(s) being switched in the next puppet commit, and then merged up the puppet bit [14:04:21] 2) depool->stop varnish backend, rm -f /srv/sd*/varnish*; puppet agent --enable; puppet agent -t; pool [14:04:48] so I put them in pairs of hosts in the commits to speed things up, but I'll leave some space between the two nodes in each pair (the second way staying puppet-disabled for a little while) [14:04:58] s/way/one/ [14:06:26] (and put some bigger time spacing between each pair-convert commit merging, so that if we do find this causes fallout days later, we have reaction time built in) [14:07:06] the most-annoying bit is we have a travel weekend coming up [14:07:21] and even cp1099 won't have reached 7d by then [14:08:07] might change it so the converted nodes still have a cron restart, but on a weekly schedule (and not commit that until, say, Friday) [14:08:19] so at least some of them will still be randomly-restarting even immediately after that, but not all [14:08:41] which will reduce the odds of us having to deal with anything over the travel weekend [14:09:19] after we get done with the ops offsite we can look at whether we want to keep a weekly restart for the converted nodes, or move it to something even longer experimentally [14:09:52] (of course if cp1099 starts significantly backlogging by thurs/fri, just go ahead and turn the daily restarts back on for all, but leave the storage improvement in place) [14:10:01] we'll see how it goes [14:11:32] cp1074 is now at 0.248, on the backlog/n_objecthead scale, where I think it fails when it reaches slightly over 1.0 [14:12:18] I'll probably let it restart normally later today, and thus it shouldn't reach 1.0, but still get to see more comparison between the two for the rest of today [14:12:45] ok [14:13:10] on how many machines do you want to run the experiment? [14:13:25] I'd like to get all of eqiad done for sure [14:13:46] I don't know that we'll have time to safely move much more than that to the storage expierment before the travel weekend [14:14:44] if we're lucky an all-storage-split eqiad doesn't need daily restarts, can move them out to weekly for now (and maybe longer later), and we reduce the mini-503-spikes a bit with that. [14:14:54] yeah [14:15:36] we still have to start using run-no-puppet for the cron restarts, but I've just noticed that using brackets in the disable message doesn't please grep [14:15:40] https://gerrit.wikimedia.org/r/#/c/312004/ [14:16:11] hmmm there's a better way to fix that [14:16:24] after all, there could accidentally be other metachars in $1 [14:16:42] right, [ is also a command after all [14:16:48] grep -F [14:17:16] brilliant [14:18:59] tested on cp1008, works as advertised [14:19:39] cool [14:20:34] other news from last night: I changed https://grafana-admin.wikimedia.org/dashboard/db/varnish-caching a bit [14:20:49] there's now a "True Hitrate" graph with lines for frontend, local, overall (which are cumulative) [14:21:08] gets the pass/int noise out of the calculation, so we can look at effects on the true hit/miss [14:21:33] basically frontend "true hitrate" is front-hit / (miss + local-hit + remote-hit) [14:21:48] err... [14:22:01] basically frontend "true hitrate" is front-hit / (miss + front-hit + local-hit + remote-hit) [14:22:26] and then the numerator changes to "front-hit + local-hit" for local hitrate, and all 3x types of hit for overall [15:06:51] ah cp1099 is dysprosium :) [15:06:58] i was wondering as that seemed like an odd number [15:07:16] yeah [15:07:31] it seems a shame to waste it since it had basically the same hardware as the other "good" 1-gen-back nodes [15:07:42] yup [15:07:58] i beat the crap out of it in testing at the time [15:08:06] like all traffic on that one server kind of thing ;) [15:08:25] after the VCA pipes bug, it could actually do that [15:08:29] before it, it was a great way to replicate it [15:08:46] upload now has 10 nodes in codfw, 11 in eqiad, 12 in esams heh (and a lowly 6 in ulsfo) [15:09:02] it was meant to be 10/10/12, but cp1099 makes eqiad 11 :) [15:09:47] it probably make some nice efficiency somewhere anyways, that the chashing can't quite work the same as we cross through the DC layers [15:12:58] hehe [15:13:04] it's new code in varnish 4 now isn't it? [15:13:10] not my chash [15:13:31] yes, that made for a lovely surprise we failed to predict (but should have known) [15:13:52] that as you roll through upgrading v3->v4, they're switching chash implementations and thus hashing to their backends completely-differently [15:13:57] yup [15:14:06] by the time the clients of a set of backends are 50/50 v3/v4 it basically cuts the storage in half in practice :) [15:14:14] yes [15:14:49] we had another related surprise like that recently [15:15:37] where ema finally fixed a long-outstanding bug with how we name backends in VCL. The fix changed the backends' names for chashing purposes, so 1 VCL commit merges through a *poof* most of our backend storage virtually-invalidated [15:16:43] yeah that was fun [15:17:18] but with the frontend caches working fine and the effects of backend request coalescing, it wasn't awful [15:17:37] as you were saying, we got both cache invalidation and naming at the same time. Just off-by-one errors were missing! [15:19:29] well because the different chashing randomly gets the same answer for 1/N objects, where N is the count of backend caches at a DC [15:19:41] our cache invalidation was also off-by-one from being a complete invalidation :P [15:21:56] :) [15:53:09] bblack: I was thinking, how about two ExecStartPost= in varnish{-frontend,}.service to chmod vsm files and restart ganglia-monitor? [15:53:32] I always end up forgetting that problem and then clusters disappear from ganglia and so on [15:54:01] well [15:54:13] I think the chmod maybe belongs there [15:54:20] I donno, it's all racy any way you do it [15:54:37] I wish we had a clean way to fix varnish code/config for this [15:54:44] or our uid setup, or something [15:55:14] I think the basic problem with ExecStartPost= is I doubt the creation of the vsm is synchronous with the outer varnishd execution returning to systemd after its daemonization fork->exit [15:55:38] maybe with a sleep built into ExecStartPost=? but then we're getting into race hacks already [15:56:04] as for ganglia-monitor, we should be able to express that dependency in systemd terms directly [15:56:19] yeah, for sure the VSM files are not guaranteed to be there after service varnish start is done [15:56:21] (as in, make ganglia-monitor's service unit depend on the varnishds in a way that forces it to restart if either of them does) [15:58:03] I know it's far from perfect, I'm not proud of the idea :) [15:58:30] but OTOH the current situation is also not good [16:00:37] yeah [16:00:50] there's a task somewhere about fixing this for at least the analytics cases, not sure about ganglia [16:01:12] https://phabricator.wikimedia.org/T128374 [16:01:36] beware outdated thinking in the old summary [16:02:11] we're just discussing making ganglia obsolete a goal next quarter [16:08:14] 10Traffic, 06Operations, 13Patch-For-Review: Decom bits.wikimedia.org hostname - https://phabricator.wikimedia.org/T107430#2656339 (10BBlack) The proposed removal date was 2 days ago, I've just been busy with other things. Will merge removal today unless objections/alternatives as above. Ping @Krinkle . K... [16:30:33] so, cp1074 managed to find a way to catch up, so far (which I guess makes sense, we don't really expect a failure until 4+ days out) [16:30:44] it's now down to .01 from the previous .28 [16:30:52] (155K backlog) [16:31:28] and backlog on cp1099 now seems less intermittent, it's showing ~1K-ish values and bouncing around [16:33:00] so probably cp1099 will eventually get into trouble too, it's just a question of when [16:33:17] maybe we've extended the time from "4-6 days" to "1-2 weeks", or "1 month" or who knows [16:36:28] hackathon project: write new varnish storage backend next week [16:38:18] * mark has become too dumb to understand varnishstat apparently [16:38:28] varnishstat -f 'MAIN.exp_*' [16:38:30] why doesn't that work? [16:38:37] or even the non-glob [16:38:53] yeah varnishstat's CLI is ugly [16:39:22] varnishstat -1 -f 'MAIN.exp_*' '*' [16:39:27] there's actually two fields for -f [16:39:38] but that doesn't work when you drop the "-1"... [16:39:53] indeed [16:40:10] I kinda gave up and just use "varnishstat -1|grep whatever" [16:40:33] what I've been manually periodically looking at on 1099+1074 is: [16:40:35] indeed [16:40:36] varnishstat -1 |egrep 'exp_|n_obj|SMF.*c_(req|free)|lru' [16:40:53] seems to capture most of the relevant bits [16:41:04] now > 10k behind [16:41:38] 23kish [16:41:40] yeah [16:42:22] still far less than cp1074 [16:42:43] and like cp1074, it may eventually catch up a bit and reduce its number [16:42:58] time will tell ) [16:45:43] is there a graphite graph for this? :) [16:46:00] grafana [16:46:13] not really [16:46:26] if you want to get the ncurses version this should work: [16:46:27] varnishstat -f MAIN.n_object -f MAIN.n_objectcore -f MAIN.n_objecthead -f MAIN.n_lru_nuked -f MAIN.n_lru_moved -f MAIN.n_obj_purged -f MAIN.exp_mailed -f MAIN.exp_received -f SMF.main1.c_req -f SMF.main1.c_freed -f SMF.main2.c_req -f SMF.main2.c_freed -f SMF.bigobj1.c_req -f SMF.bigobj1.c_freed -f SMF.bigobj2.c_req -f SMF.bigobj2.c_freed -f LCK.lru.creat -f LCK.lru.destroy -f LCK.lru.locks [16:47:02] I think varnishstat output only goes to ganglia presently, right? or is there some other existing path for it into graphite data? [16:47:04] ema: doesn't show me exp_mailed/etc [16:47:11] bblack: may be yes [16:47:20] we're just discussing making that part of a goal for next quarter [16:47:28] yeah [16:47:41] really killing ganglia [16:47:46] in any case, even in ganglia it's hard to make sense of, but it can be done I *think*? [16:47:58] not important, just curious [16:49:14] mark: uh yeah, interesting, it's not showing MAIN.exp_mailed even though it's in the list [16:49:20] yeah [16:49:25] when I only had those, it didn't show anything [16:49:34] the problem is ganglia's graphing of exp_* shows the rate, not the raw value [16:49:40] (much less the raw value diff) [16:49:41] right [16:49:48] https://ganglia.wikimedia.org/latest/?title=cp1074+mailbox&vl=&x=&n=&hreg%5B%5D=cp1074&mreg%5B%5D=varnish.exp_%28mailed%7Creceived%29>ype=line&glegend=show [16:49:52] ^ is the rates on cp1074 [16:50:25] you can kinda tell from there that ~16:00 today is when it started persistently and significantly falling behind, though [16:50:55] that just shows the ganglia main page for me somehow [16:51:28] yeah [16:51:32] it's kinda messed up :P [16:52:05] https://ganglia.wikimedia.org/latest/graph_all_periods.php?title=mailbox&vl=&x=&n=&hreg%5B%5D=cp1074&mreg%5B%5D=varnish.MAIN.exp_.*>ype=line&glegend=show&aggregate=1 [16:52:08] ^ try that [16:52:14] and that URL's editable to switch hosts easily heh [16:52:16] yep tnx [16:52:33] looks like no ganglia data on cp1099 presently heh [16:52:56] restarted ganglia-monitor there [16:53:35] bblack: /var/lib/varnish/cp1099/_.vsm needs a chmod :) [16:53:50] but you can also see where it was falling behind before and caught up very quickly at ~14:24 [16:54:02] done [16:54:26] bblack: I've gotta go soon, tomorrow I can work on the storage transitions [16:54:40] are you planning on starting with some hosts today? [16:55:08] with a limiter on the data it's clearer: [16:55:10] https://ganglia.wikimedia.org/latest/graph.php?r=day&z=xlarge&title=mailbox&vl=&x=500&n=&hreg[]=cp1074&mreg[]=varnish.MAIN.exp_.%2A>ype=line&glegend=show&aggregate=1&embed=1&_=1474476862829 [16:55:46] it's not really a gradual drop in mailbox processing. it's more like mailbox processing stalls out completely for a short time, then runs at "normal" rate constantly-behind, and then catches up very quickly much later. [16:56:18] well rates not absolutes heh [16:56:40] the rate at which is reads the mailbox drops off sharply to a flat low value for a while, then suddenly plays catchup later [16:56:51] but 4-6 days out, it just gets too far behind to ever catch up [16:57:11] ema: I was planning to. at least the first of those commits for pairs of eqiad hosts -> new storage layout [16:57:30] <_joe_> bblack: so the multiple-storage-buckets strategy is working out fine? [16:57:45] _joe_: it's definitely an improvement, and not just in the expected/important ways [16:57:58] but I doubt it completely fixes the problem, just makes the timescale reasonable [16:58:09] <_joe_> bblack: well that's nice to hear [16:58:27] <_joe_> bblack: tomorrow I'll package/distribute conftool 0.3.1, with the correct exit code [16:58:28] instead of daily restarts to be conservative at avoiding the problem, if we're lucky it gets us to weekly restarts [16:58:36] <_joe_> wow [16:58:38] (if we're very lucky, longer than weekly) [16:58:46] or restart when cron finds a big backlog? [16:58:49] <_joe_> and what else improved? [16:59:05] hard to coordinate maybe [16:59:22] _joe_: we're storing ~33% more total objects than before, likely because the size-classing has eliminated huge inefficiencies/fragmentation [16:59:31] <_joe_> bblack: heh, nice [16:59:36] <_joe_> wow. [16:59:47] _joe_: the code is extremely naive: when making space for a large new object, it tries to remove space of old objects _in lru order_ [17:00:03] with no regard to the size of the objects it's removing relative to the size of the new object [17:00:06] but those objects are almost certainly completely fragmented, so before it has a large enough _consecutive_ piece of space... [17:00:07] <_joe_> heh naive indeed. [17:00:23] it might end up removing half the objects in storage before that happens ;p [17:00:39] with the new scheme, the largest objects in a bin are no more than 16x the size of the smallest objects [17:00:49] (except for the very bottom size class which is 0-16KB size range) [17:00:56] <_joe_> that's why things like cassandra do "compactions" [17:01:06] before we were mixing up things like 1K objects and 99MB objects :) [17:01:27] <_joe_> bblack: what host is doing the experiment? cp1099? [17:01:30] yes [17:01:51] and cp1074 is the best reasonable comparison (same role/dc, and their varnishd backend start times were ~30 mins apart) [17:02:07] but... cp1099 only has 32 cpu cores and cp1074 has 48. and cp1074 has more memory to use, too [17:02:21] so it's kinda of an unfair comparison favoring the old scheme, which I guess is the better direction to err in [17:03:34] <_joe_> cool! [17:03:51] <_joe_> yes the results don't look bad, but ganglia has issues on cp1099 [17:04:29] I just fixed them, I think [17:05:14] so starting at some point today (soon) I'm going to start converting the rest of eqiad, hoping to have all eqiad on new storage before the weekend [17:05:48] depending on cp1099 is playing out by thus/fri, probably either turn on a weekly-restart cron for the ones on the new storage, or just turn on the same daily cron just to be safer while traveling and onsite, etc [17:06:16] <_joe_> I now feel confident on how to handle further problems like the weekend, btw, so on saturday I've got you covered [17:06:41] I don't think ema and I are flying at the same time either [17:06:42] ema and brandon travel on different days too [17:07:00] but still it's a total PITA if any of us has to log in for hours in the midst of getting settled in and seeing barcelona for a day or whatever [17:07:40] I'd rather be conservative if we need to be, and then experiment more with longer uptime when we get back [17:08:11] either way, we know both storage setups are safe for 24+ hours [17:08:52] <_joe_> btw, I should really buy my data plan for abroad now [17:10:13] I'm gonna see how Google Fi works out (supposedly it will work out great) [17:10:24] been using that her ein the US for a few months now [17:10:41] I believe using 1 GB of data will cost me about 50E [17:10:45] so i'm not going to worry about it [17:10:55] from next year onwards, it's within bundle for all of EU [17:11:12] while in the US, it's $20/mo for unlimited calls/text, + $10/GB for data (with a very nice way of billing data, no caps/limits/other-charges for lots of usage, etc) [17:11:56] bblack,ema: thanks to your patience I have vk 1.0.12-1 on built on copper. Lintian and debdiff with 1.0.11-1 looks good, so I'd like to manually install it to a upload host and see how it goes (before uploading it to reprrepro) [17:12:04] <_joe_> I pay 10E and I get 500 MB of data, plus calls and sms, for a week [17:12:06] supposedly Fi has contracts in place in ~100+ foreign countries, your phone roams there fine, uses good data speeds, and data/sms plan is the same as at home, but text calls may charge some $.XX/min long distance charges (which they inform you of on landing, and you can always do Hangouts calling over data anyways) [17:12:07] on built == built [17:12:10] <_joe_> it's pretty nice [17:12:39] <_joe_> bblack: so 10$/gb? [17:12:47] <_joe_> in the US or abroad? [17:12:52] with Fi you can also buy additional data-only SIMs attached to the same account, that just go into your monthly billing of $10/GB [17:13:04] _joe_: everywhere they support (US + some list of 100+ countries) [17:13:16] <_joe_> bblack: ok my data is much cheaper at home [17:13:24] so you can get an extra SIM for your tablet or whatever, and it's the same as if you tethered through your phone for billing [17:13:25] <_joe_> I pay 10E for 7GB [17:13:49] you're paying 10E == 11.15 USD for 7GB [17:13:57] <_joe_> actually now it's 11 for 8 GB, they just swindled me a bit more [17:13:58] I'm paying $10 for 1GB, so yeah [17:14:22] but as far as US carriers go, it's a good deal here domestically [17:14:24] <_joe_> bblack: yes but it's a hard cap, if I exceed 8 GB I get the 56k experience [17:14:36] there's no cap on this one that I'm aware of [17:14:48] <_joe_> which is good for business purposes [17:14:52] <_joe_> ok I really gotta go now [17:14:54] <_joe_> ttyl [17:14:57] and you tell them how much you plan to use - e.g. set up billing as 2GB/month or whatever, but really that's just so you get nice alerts [17:15:15] either way if you go over, they charge you the same $10/GB, and either way if you over-estimate, the refund you for your leftover data (applied to next bill) [17:15:19] bye! [17:15:57] * mark is off as well [17:16:59] back on the "Fi seems awesome" thread, the other nice thing is they don't have their own tower, but they have contracts with multiple carriers in the US [17:17:20] so you're using either T-Mobile or Sprint or US Cellular's towers, whatever's better wherever you're standing presently [17:18:04] (which is why they limit the devices they support with Google Fi - the supported Nexus devices have radios that know how to use all the relevant networks, instead of just the freqs of particular classes of carrier) [17:18:51] ... and then the WiFi part is pretty cool too. there's obviously no billing for data you do over WiFi, and they build-in an automatic encrypted VPN when you use public networks [17:19:05] so if you connect to a random shitty public access point, your traffic is both free and protected [17:19:34] (protected back to google's network, anyways) [17:20:03] * bblack Google Fi salesperson mode: off [17:20:54] <_joe_> "Project Fi is available only for accounts in the US." [17:21:18] that doesn't seem hard to hack around though :) [17:21:46] I know there's some blog posts out there about full time global-roamers using it, they just set up a US billing address and CC for the account [17:22:25] the biggest limitation is the limited device support [17:22:56] in theory you can move the SIM to whatever phone, but they really only "support" with all of the supposedly-useful networks, on those devices. other devices might not have the right radios to get the best of it all. [17:23:32] currently Nexus 6, 6P, and 5X. I'd assume the new Nexus 7 stuff coming out any day as well [17:23:59] or whatever the next nexus is called [17:24:27] I guess Pixel and Pixel XL [17:25:39] MAIN.exp_mailed 29873260 192.32 Number of objects mailed to expiry thread [17:25:42] MAIN.exp_received 29873260 192.32 Number of objects received by expiry thread [17:25:45] ^ cp1099 did catch back up [17:29:18] bblack: sorry to interrupt, would it be fine to manually install vk 1.0.12-1 to a cache upload host? If the test goes as expected I'll upload it tomorrow to reprepro and install it everywhere [17:34:43] 10Traffic, 06Operations, 13Patch-For-Review: Decom bits.wikimedia.org hostname - https://phabricator.wikimedia.org/T107430#2656693 (10BBlack) Same data logging as back on Sep 7, but using Sept 21 data. Not much change in the overall, and still close to the same overall level (~1.46% of all requests): ``` l... [17:36:37] elukey: sure. use a codfw upload host? [17:37:12] e.g. 2002 [17:37:32] I already copied the deb to 3034 in esams, would it be ok ? [17:37:48] otherwise I'll go for 2002 [17:38:29] yeah 3034 is fine [17:39:36] Sep 21 17:39:16 cp3034 systemd[1]: Started VarnishKafka webrequest. [17:39:36] Sep 21 17:39:16 cp3034 varnishkafka[69414]: VSLQ_Dispatch: Log acquired! [17:39:44] seems working fine :) [17:42:11] kafkacat on stat1002 looks good too [17:47:14] logging off, cpu wise vk stays around 6% and it is working fine [17:47:21] thanks again for the help today bblack! [17:48:04] np, cya later [18:12:24] cp1074's next scheduled daily restart is coming in up in ~4.5 hours, and I'm gonna just let it happen (I supressed the last one) [18:13:18] the backlog is now even bigger than it was before, and we've seen all the comparison we need to know that cp1099's setup is at least "better". beyond that all we really care about is cp1099-like hosts' average time-to-failure [20:36:47] 10Traffic, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 06Operations, 13Patch-For-Review: CN: Stop using the geoiplookup HTTPS service (always use the Cookie) - https://phabricator.wikimedia.org/T143271#2657416 (10DStrine) [22:00:52] heh cp1074 didn't quite make it to the 2-day mark, by ~45 minutes or so [22:05:00] I restarted it manually and disabled puppet and the cron for now, so it doesn't pointlessly-restart itself 45 minutes from now [22:09:43] probably got triggered to go off a little faster by my depools to convert a couple of other nodes shifting a little excess miss traffic to it [22:15:12] <_joe_> probable, yes