[06:08:49] 10Traffic, 10netops, 06Operations, 10Pybal: Frequent RST returned by appservers to LVS hosts - https://phabricator.wikimedia.org/T163674#3250599 (10elukey) Checked as well, thanks for the pointer! tcpdump -i lo doesn't show any RST for apache. [06:21:24] 10Traffic, 06Operations, 13Patch-For-Review: prometheus-vhtcpd-stats cronspamming if vhtcpd is not running yet - https://phabricator.wikimedia.org/T157353#3250615 (10elukey) [06:42:38] 10Traffic, 10DNS, 06Operations, 06Services (next): icinga alerts on nodejs services when a recdns server is depooled - https://phabricator.wikimedia.org/T162818#3250630 (10MoritzMuehlenhoff) p:05Triage>03High [07:23:00] logrotate on cp1052 is broken for a while, I'm going to rm /var/log/*gz [07:23:10] they're old logs from November so.. [08:05:38] 10netops, 06Operations: LibreNMS improvements - https://phabricator.wikimedia.org/T164911#3250758 (10ayounsi) [08:25:22] 10netops, 06Operations: LibreNMS improvements - https://phabricator.wikimedia.org/T164911#3250802 (10ayounsi) [09:40:29] paravoid: thanks! :) [09:52:23] 10Traffic, 10netops, 06Operations, 10Pybal: Frequent RST returned by appservers to LVS hosts - https://phabricator.wikimedia.org/T163674#3205379 (10akosiaris) Those RSTs are the result of a packet being sent to an already closed socket. In the tcpdump pasted above, nginx sends the FIN, ACK packet and then... [10:03:46] 07HTTPS, 10Traffic, 06Operations: wikispecies.org uses an invalid security certificate - https://phabricator.wikimedia.org/T164919#3251002 (10abian) [10:06:11] 10Domains, 07HTTPS, 10Traffic, 06Operations, 10Wikimedia-Site-requests: SSL error for https://wikispecies.org/ - https://phabricator.wikimedia.org/T164868#3251019 (10abian) [10:06:14] 07HTTPS, 10Traffic, 06Operations: wikispecies.org uses an invalid security certificate - https://phabricator.wikimedia.org/T164919#3251021 (10abian) [11:24:46] ema: to your comment about the upload+maps 404 clause in FE vcl... I was going to say that req_handling is supposed to handle that when there's no "default" listed [11:25:19] ... and it does, I wasn't entirely remembering wrong, but now I realize it only sets up the 404 in the backend VCL [11:26:10] so they'd just be 404 cache objects with our usual 10-minute 404 TTL cap [11:26:14] (in the FE) [11:26:30] which really isn't ideal, there's no point creating frontend 404 objects for all such requests [11:27:41] maybe I should fix that in the vcl+ruby code, just for the case of a cluster with no "default" [11:34:45] I started to say, it's not worth it because this will be the only case [11:35:34] but I could see us removing "default" from the text cluster later too. Once the redirects are moved to a secure-redirect service, "default" could become an actual regex for the canonical domainnames and do a quick 404 on anything else at that level. [11:52:30] eh, can deal with this problem later. maybe s/vcl/lua/ will get there first :P [12:08:05] refactoring the maps->upload stuff a bit (fix the above, and split up the move/delete commit into easier steps, look at some monitoring bits, etc) [12:10:43] eh nothing can really be fixed about the monitoring easily [12:11:11] the clsuter still gets monitored, but the icinga check will be on the test upload URI only, not the maps one. I don't think the lvs-data-parsing icinga check setups allow for multiple check uris [12:11:30] (not that we don't already have this problem on text too - we're only checking special:blankpage, not RB or cxserver URIs, etc) [12:13:01] what's the longest (media) originals can stay in varnish? [12:13:05] ema, bblack, I got some questions about LVS! [12:13:26] gilles: trick question, no sane answer! :) [12:13:38] XioNoX: ? [12:14:07] bblack: I know the feedback loops that can happen between the varnish layers, but what's a ballpark for the worst case and the common case? [12:14:38] bblack: basically trying to assess if I'm going to need purging them or not due to the addition of a new header, hopefully not [12:14:49] (header served by swift) [12:14:59] it really depends what you mean, there's a lot of different things going on in this space, I'll try to break that down a little though: [12:15:55] to phrase it differently, content for originals gets updated in swift, how long roughly before 99% of the copies in varnish have the new version? [12:15:55] so, I don't think Swift provides any max-age upon which to base age calculations [12:16:16] gilles: about 24 hours [12:16:28] awesome, thanks [12:16:34] ok :) [12:16:39] no purging required at all [12:16:44] I guess that'll make you happy [12:16:57] is it swift making the header change, or just passing it through? [12:17:10] objects are going to be updated in swift [12:17:14] by a maintenant script [12:17:17] maintenance [12:17:18] so, this is where complications may come in with 304s, for instance [12:17:42] is the header change going to update last-modified and make it fail an If-Modified-Since request? [12:18:01] thumbor is going to be the consumer of the header, so 304s shouldn't matter [12:18:02] (or alternatively, will it give a "304 Not Modified", but also send the newly-changed header with the response? that works too) [12:18:14] it matters to varnish [12:18:21] bblack: so I'm working at adding a new service to LVS, following https://wikitech.wikimedia.org/wiki/LVS#Add_a_new_load_balanced_service and godog's help (and previous patches https://gerrit.wikimedia.org/r/#/c/324371/ ) but I'm wondering how to add monitors for UDP services in hieradata/common/lvs/configuration.yaml, as well as how do distinguish UDP from [12:18:21] TCP services in conftool-data/services/services.yaml [12:18:33] actually now that I say it, I'm dumb, thumbor will get it from swift directly, so it doesn't matter if varnish is out of date... [12:18:34] nevermind [12:18:45] still would be good things to know! :) [12:19:00] I guess it's good to know for debugging, that's all [12:19:02] right! [12:19:08] one of the caveats with recent TTL reductions is that 304 behavior from backends of varnish has become way more important [12:19:39] if backends are implementing 304 poorly (e.g. lying about "Not Modified", when a fresh request would in fact return something different), varnish could potentially cache a stale object perpetually [12:20:31] bblack: that's where I am right now: https://phabricator.wikimedia.org/P5412 [12:21:03] based on what I've seen last-modified does get updated [12:23:23] XioNoX: conftool doesn't know about udp-vs-tcp [12:23:26] (doesn't need to) [12:24:31] bblack: I'm not very familar with conftool yet, what's the point of conftool-data/services/services.yaml exactly? [12:26:47] the point of everything underneath conftool-data/ is just to define the members of a cluster and the service names hosted by the nodes of the cluster [12:27:04] I'm honestly not sure why it even has "port" data [12:27:21] ah okay, good to know, thx! [12:27:23] (since that's functionally in the LVS config in hieradata/common/lvs/configuration) [12:27:41] re: UDP monitoring, that's tricky [12:28:09] ignoring all our puppet/hiera/conftool mess, it's probably useful to step back and say "How would you generically monitor a UDP service?" [12:28:17] because there is no generic way :) [12:28:42] the only other UDP LVS service we have is our dnsrecursors, and pybal/LVS has a special monitor for dns queries to monitor that [12:28:47] especially one that doesn't return traffic [12:28:57] heh, yes [12:29:40] I guess you could send it a UDP packet and check with some timeout that you don't get some kind of error back due to no listener? but you'd have to be sure your junk packet would be silently discarded by the service too? [12:30:13] or have logstash actually-respond to a special UDP packet that's meant to ping the service [12:30:19] and in either case, write new pybal monitoring code [12:30:34] godog suggested to either monitor its equivalent TCP port (possible for 11514 udp/tcp) but not for gelf (12201/udp only), or check if there an error, but it's possible that the service is dead and doesn't return anything [12:32:29] having logstash reply to a special udp packet would mean patching logstash, which I guess is not trivial [12:33:45] bblack: can we have the udp service monitor the tcp port (or a different tcp port)? We can assume that if tcp is down, the whole thing is down anyway [12:33:58] maybe [12:34:18] for logstash that's possible yeah, its listeners are managed by the same process/jvm [12:35:02] I'm not sure how it plays out for the LVS/pybal config [12:35:16] we'd just have to dive into all the templating of that and/or the pybal code and see [12:35:45] also, I'm now really curious why conftool lvs services have port numbers defined in conftool-data redundantly to the consuming LVS configs... [12:36:08] maybe I once knew! :) [12:39:37] godog, bblack, where can I see the list of possible monitors for pybal? [12:39:57] anyways, the whole thing with LVS/pybal/conftool/etc configuration of services has too much organic history, probably needs some design cleanups someday [12:40:25] https://github.com/wikimedia/PyBal/tree/master/pybal/monitors [12:40:49] bblack: hey :) [12:40:54] we're fine with applying the N-hit wonder stuff to maps too I assume [12:41:46] ema: yeah I guess so :) [12:42:16] XioNoX: there's a "runcommand" monitor that might prove useful [12:42:43] yeah I was looking at that, there is only a /etc/pybal/runcommand/check-apache for now though [12:43:36] yeah [12:44:06] you could implement something there that actually ssh's to the host (like check-apache) and executes some unprivileged command [12:46:04] bblack: is it possible to not have a monitor? [12:46:10] probably [12:46:33] or if it's not, you could probably define a runcommand that executes /bin/true [12:49:10] I'm trying the maps VTC tests against the merged VCL and pretty much none of them passes, heh :) [12:49:52] but it makes sense, tests assume no N-hit wonder code and we didn't use to require proper values in the Host header and such [12:50:14] but in that case, if only that UDP service dies (very unlikely) on one of the server, it would not fail over, correct? [12:52:36] ema: yeah I kinda ignored the problem of the maps-cluster VTC [12:52:55] I'm not sure if those should be ported? if they make sense to [12:53:00] maybe only some do [12:53:24] XioNoX: yeah, but you have the option of manually depooling via conftool at least [12:53:48] https://mikehadlow.blogspot.com/2012/05/configuration-complexity-clock.html [12:54:12] "I’ve never seen an organisation go all the way around the clock" lol, let me introduce you to VCL! [12:54:57] haha [12:56:43] we've actually lapped the clock [12:57:27] we got to the initial DSL (VCL), then had config values for it ($vcl_config in puppet) and now we're at the "rules engine" stage of our second cycle around the clock with the hieradata-driven ruby-templated VCL [12:58:02] re: maps VTC, actually none of the tests are maps-specific [12:58:26] I think maps is simply the first cluster I've started writing tests for, so I just went nuts and tested everything there [12:59:19] the only actual functional requirement in maps VCL is unsetting cookies, right? Which we do in upload, so \o/ [12:59:34] :) [13:00:23] now I just need to invent a DSL for doing all the more-complex parts that hieradata-driven rules engine of "req_handling" can't yet handle, and we'll have two full cycles of the clock :P [13:00:54] (I guess when we ran squid in ancient days, before we switch to varnish, was the 3am "config" mark on the original clock) [13:23:42] ema: yeah so I guess, can merge that now and we can manually test /etc/hosts hacks of maps.wikimedia.org->upload-ip and whatnot, without affecting real maps users [13:23:57] and storage doesn't matter yet either at low req volume from manual testing [13:25:25] on the storage front, I was considering re-doing the analysis with smaller bin ranges (e.g. 10 bins instead of 5 or whatever), but I donno, maybe now with the TTL=1d the mailbox issue has calmed down anyways? [13:26:36] we could also just break up the worst bins maybe [13:28:26] bblack, godog: https://gerrit.wikimedia.org/r/#/c/353064/ let me know what you think! [13:29:00] bblack: +1 on merging the merge :) [13:29:20] XioNoX: nice, will do! [13:29:27] storage-wise yeah, perhaps breaking up the worst bins is enough [13:34:16] XioNoX: first pass through found some basic issues, I didn't dig deep on the rest yet, so take it FWIW [13:34:48] yeah, I found some too, mostly me strugglign with conflict resolution [13:38:50] thx [13:40:42] bblack: maps->upload happening now right ? [13:42:35] elukey: no [13:42:51] elukey: well, the functional bit is, for manual testing (which might generate a few manual events!) [13:42:59] elukey: the users are still using the traditional maps cluster [13:43:40] bblack: thanks, comments fixed! [13:43:48] sure sure no problem, I saw the +2 and asked for curiosity.. there might be some changes to do in hadoop but nothing super urgent [13:44:07] elukey: what kind of warning do you need? [13:44:36] we still have a storage issue to work out before we flip users over, but I'm still wishfully-thinking I can do it this week heh [13:44:42] bblack: just a ping when it happens, nothing more.. no blockers, it is just to clean up after the change [13:44:51] ok awesome [13:48:45] ema: heh I just realized an unintended consequence of how I handled the purge regexes, which is that the current maps cluster is now rejecting maps purges :) [13:48:56] but it kinda doesn't matter because maps never did set up anything to send purges yet [13:49:01] (I think) [13:50:59] (yes, I was right, they're not sending purges anyways yet) [13:57:34] 10netops, 06Operations, 13Patch-For-Review: analytics hosts frequently tripping 'port utilization threshold' librenms alerts - https://phabricator.wikimedia.org/T133852#3251556 (10ayounsi) 05Resolved>03Open Today the analytics hosts saturated their uplinks for about 2h, so that goes beyond a reasonable t... [13:58:06] XioNoX: reading --^ [13:58:42] ah ok I thought it was something super bad :) [13:59:08] :) [13:59:24] are they 10G links? [13:59:44] bblack: 1G https://librenms.wikimedia.org/graphs/to=1494423900/id=12189/type=port_bits/from=1494402300/ [13:59:52] I wonder if they're causing issues elsewhere? do we monitor the saturation of the inter-switch ports and such? [13:59:56] ah ok [14:00:12] bblack: yeah we do, no issues anywhere else [14:00:35] bblack: another option is to add a 2nd 1G link if they need more bandwidth [14:01:07] yeah we've done bonding before, but it's kinda ewww on the puppetization / host-config front [14:01:19] it would be easier to upgrade them to 10g cards/ports if possible [14:01:36] (if it's even an issue, maybe it's not worth it) [14:03:18] upload-lb.eqiad.wikimedia.org has address 208.80.154.240 [14:03:29] bblack@alaxel:~$ curl -v https://maps.wikimedia.org/ --resolve maps.wikimedia.org:443:208.80.154.240 2>&1 | grep 'maps beta' [14:03:32] bblack: who owns that analytics hosts, who can I ask about it? [14:03:34] Wikimedia maps beta [14:03:57] elukey knows some things about analytics! :) [14:12:29] maps looks good with 91.198.174.208 maps.wikimedia.org in /etc/hosts :) [14:12:52] yeah [14:13:09] I'm doing some followup fixups for pedantic sake [14:13:19] I hadn't noticed that kartotherian is actually setting a bunch of the security-ish headers [14:13:32] CSP / Access-Control-* / etc [14:13:51] so I'm gonna wrap some upload conditionals around the similar header code from cache_upload to let the original maps ones through [14:14:08] at least it will preserve more of the existing behavior. who knows if either one is fully sane at present [14:15:56] ok [15:40:49] https://grafana.wikimedia.org/dashboard/db/varnish-transient-storage-usage [15:41:27] that's max by (job, layer) (varnish_sma_g_bytes{type="Transient"}) [15:42:17] interesting behavior starting on the morning of May the 5th apparently [15:44:57] 10Traffic, 06Operations: Explicitly limit varnishd transient storage - https://phabricator.wikimedia.org/T164768#3252052 (10ema) Note that the limit cannot be set using a configuration parameter but rather by defining a storage backend named "Transient". For example: `-s Transient=malloc,1G`. See https://www... [16:01:50] 10Traffic, 06Operations, 10Page-Previews, 06Performance-Team, and 3 others: Performance review #2 of Hovercards (Popups extension) - https://phabricator.wikimedia.org/T70861#732861 (10Tbayer) Hi, I have been trying to get up to speed on this before our meeting today. Reading through the discussion above,... [16:17:00] ema: you mean it's max by cluster+layer, or sum by cluster+layer? [16:17:05] oh I guess max, you're right [16:17:15] max [16:17:25] and yeah the numbers are crazy. this probably contributes a lot to the ??? about frontend mem sizing vs ooms [16:17:39] yup [16:17:58] esams text fe maxed out at 50GB of transient? [16:18:07] that's really the units, I mean? [16:18:24] it's bytes, yeah [16:19:34] what happened on the 5th that can cause that kind of growth? [16:19:49] well I have to go stare at timelines [16:20:01] but I suspect that correlates with the restarts that bumped fe mem? [16:20:13] possibly yeah [16:20:21] maybe some other VCL change too, donno [16:23:36] according to SAL we did bump frontend mem sizing on the 5th. Also varnish-be don't seem affected [16:25:01] oh, and on the 8th we resized (hence restarted the frontends, which would explain the falls in transient storage usage) [16:26:15] yeah I sized back down on the 8th, because ooms [16:26:32] but, why upload+text affected similary, etc? [16:26:50] did you restart all the frontends after the 4.1.6 update (or other recent package updates)? [16:27:18] I'm just wondering if the restart for the fe mem sizing was bringing in other pending changes from that... [16:27:42] well some of the 4.1.6 upgrades and mem resizing were concurrent IIRC [16:27:54] oh, right [16:28:08] upload upgrades were concurrent with mem resizing [16:28:09] the mem size bump doesn't seem like it should cause this [16:28:41] hmm but text took off before upload [16:28:47] oh, 4.1.6 includes the mmap fix [16:29:00] I'm guessing right after text 4.1.6 upgrade is when text takes off, slightly earlier? [16:29:04] (which shouldn't have anything to do with this I think) [16:29:12] the mmap fix doesn't apply to malloc storage, right [16:29:38] I've upgraded text to 4.1.6 on the 4th [16:30:32] yeah seems to correlate, text starts growing earlier than upload [16:31:27] hmmm I wonder what else changed in 4.1.6 [16:31:39] the changelog shortlist doesn't have any big red flags for this [16:36:07] I'm not sure what VEV does, but besides that no big changes, yeah [16:37:09] https://github.com/varnishcache/varnish-cache/commit/3c1c9703b1c2b837b9b40eebe3753f53301a545c [16:38:47] I've gotta go, will check in later on after dinner. o/ [16:40:05] https://github.com/varnishcache/varnish-cache/commit/14ce48044a680032adc51244f58dac03c391cea1 [16:40:41] ^ possibly the max-age fix has done something to create a lot more hit-for-pass for us, due to one of our many conditionals around such things and TTL <=> 0s? I have no firm theory there, but something related is plausible [16:41:39] there are actually a lot of such scenarios to think about, I think [16:42:57] ultimately all they've really done there is potentially rounded some Age values downwards [16:43:13] maybe this is taking some TTL=1s and turning them into TTL=0s and triggering hfp for us? [16:43:42] (and maybe before, hitting a "real" TTL=0s object was a rare edge case, but hitting them at 1s-rounded-down-to-0s is much more common) [16:44:25] also, by their nature transients should be short-lived. maybe somehow we're setting a long ttl on transients... [16:44:58] (any chance ttl/keep changes line up with the impact here too?) [16:45:03] I'll look more in a little while [17:01:38] yeah I'm kind of assuming at this point it's our 7d-keep code and/or rounding some previous ttl=1 to ttl=0 (and the interaction of that with hfp-creation) [17:02:01] maybe we're creating bad hfps, maybe we're setting 7d keep on what ends up being an hfp and that locks it into transient for 7d? no idea yet [17:02:36] something related in wm_common_backend_response, which may or may not be exacerbated by the new TTL rounding... [17:03:49] surely beresp.keep has no effect on hfp (beresp.uncacheable)? it wouldn't make sense, you can't really conditional-check it, right? [17:04:20] if it does, regardless we should set a more-limited beresp.keep when we set beresp.ttl=601 for hfp maybe [17:15:55] hmm I keep mixing up age and ttl [17:16:05] they're floor()ing age, not ttl [17:16:31] so it might turns some previous age=1 into age=0, but that shouldn't affect the incidence of ttl=0 [17:31:07] 10Traffic, 06Operations, 10Page-Previews, 06Performance-Team, and 3 others: Performance review #2 of Hovercards (Popups extension) - https://phabricator.wikimedia.org/T70861#3252563 (10Tbayer) >>! In T70861#3206917, @Gilles wrote: > Link interaction seems like a viable candidate in the quantitative metri... [18:09:54] robh, mr1-ulsfo crashed, could you power cycle it? [18:10:12] hrmm, crashed and is online or offline completely? [18:10:24] it has a single power feed, so if its offline its usually an indicator we lost a pdu tower [18:10:52] icinga show down also asw-ulsfo [18:11:36] robh: I ran a command to cleanup disk space, the router didn't like it [18:11:42] ahhhh [18:11:53] XioNoX: that is way better than other possible things [18:12:09] i'll drop a smart hands request to unitedlayer to do so, since otherwise its a half day for me to get there and back [18:12:11] and do so [18:12:15] (will email them now) [18:12:34] volans: yeah, it's its management interface, going through mr1-ulsfo [18:12:50] ok [18:15:02] 10netops, 06DC-Ops, 06Operations: mr1-ulsfo crashed - https://phabricator.wikimedia.org/T164970#3252768 (10ayounsi) [18:15:15] XioNoX: email sent =] [18:15:49] robh: opened https://phabricator.wikimedia.org/T164970 [18:15:55] thanks [18:16:35] 10netops, 06DC-Ops, 06Operations: mr1-ulsfo crashed - https://phabricator.wikimedia.org/T164970#3252782 (10RobH) I emailed support to reboot it via power cable removal: > Support, > > In remotely administering our mr1-ulsfo Juniper SRX100 device, it locked up and is unresponsive to our attempts to connec... [18:16:58] it seems worth asking remote hands, otherwise it waits for me to go there which kills my afternoon of doing other things [18:17:04] plus ill have to spend enough time there soon enough =] [18:17:22] leaving our mgmt router offline for poitential out of band access seems bad, heh [18:25:34] yeah totally [18:33:58] also fyi, the Zayo link between cr2-ulsfo and cr1-codfw is down [19:21:25] looks like ulsfo is using considerably more transient storage than codfw: https://phabricator.wikim [19:22:13] volans: BTW, it would be nice to have an option to get that kind of output with cumin! I had to fallback to salt -otxt to generate it [19:22:24] oh, pastefail [19:22:25] https://phabricator.wikimedia.org/P5417 [19:23:57] varnish-frontend process runtime is comparable between e.g. cp4016 (1 day 23h) and cp2004 (1 day 20h) [19:24:42] salt --out=txt, that is :) [19:25:48] ema: which feature? [19:26:22] volans: something equivalent to salt --out=txt, see https://phabricator.wikimedia.org/P5417 [19:26:52] the equivalent is in the TODO list :) for that output which command did you run? can I run it? [19:27:07] or it's heavy/risky [19:27:20] not at all, the command is `varnishstat -n frontend -1 | grep SMA.Transient.g_bytes` [19:27:39] ok, thanks, I'll have a look later [19:27:53] volans: thanks! :) [19:29:56] on testwiki and mediawiki it seems like the TLS terminators are rejecting upload POSTs that are only a few MBs [19:30:02] has something changed today or recently? [19:30:14] nginx returning 413 Request Entity Too Large [19:31:02] ah, it only does it when I have the X-wikimedia-debug header [19:31:06] I'll file a task [19:38:10] I've added plotting of Transient.g_bytes to varnish-machine-stats: https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?orgId=1&panelId=58&fullscreen&from=now-30d&to=now [19:39:23] maybe we could downgrade a text host to 4.1.5-1wm4 and compare it with another running 4.1.6? [19:41:57] or go back to fe_mem = 0.4 * memsize, one of the two must be the culprit [19:43:37] 0.4 on a single machine again for comparison [19:45:56] wait a minute [19:46:10] https://gerrit.wikimedia.org/r/#/c/343845/4/modules/varnish/templates/vcl/wikimedia-common.inc.vcl.erb [19:46:30] first we set ttl = 0 for private objects [19:46:42] than we cap it to 1d [19:47:02] I think that means we're creating hfp objects and keeping them one day [19:49:06] mmh, no :) [19:51:22] (because we set ttl=1d only on objects with ttl>1d, and because hfps get created only for objects with ttl <= 0) [19:59:39] re: hfp objects for conditionals, it doesn't seem to make sense, no [20:00:23] (unless there's a bug) [20:01:09] but anyways, we would have noticed that already with keep=3d if that were the case [20:04:07] so the changed done roughly around the same time are: (1) upgrades to 4.1.6 (2) frontend-mem resizing (3) keep bumped from 3d to 7d and ttl lowered from 3d to 1d [20:04:28] let's see the timelines [20:05:49] (1) 2017-05-04 11:45 cache_text upgrades started, 2017-05-05 06:45 cache_upload upgrades started [20:09:26] (2) 2017-05-04 13:48 `fe_mem = 0.7 * memsize - 48` 2017-05-08 15:19 `fe_mem = 0.7 * memsize - 80` [20:10:01] (3) 2017-05-04 18:48 [20:10:19] busy day, May the 4th :) [20:10:24] 10netops, 06DC-Ops, 06Operations: mr1-ulsfo crashed - https://phabricator.wikimedia.org/T164970#3253261 (10RobH) a:05RobH>03ayounsi So united layer support rebooted this for us, and now @ayounsi is working on recovery. [20:14:50] OK, gotta go again :( [20:15:30] 4.1.5-1wm4 is on cp1008 under /var/cache/apt/archives if we want to try the downgrade-on-one-host route [20:38:45] 10netops, 06Operations, 10ops-codfw: codfw: kubernetes200[1-4] switch port configuration - https://phabricator.wikimedia.org/T164988#3253327 (10Papaul) [20:50:35] 10netops, 06Operations, 10ops-codfw: codfw: kubernetes200[1-4] switch port configuration - https://phabricator.wikimedia.org/T164988#3253366 (10RobH) 05Open>03Resolved a:03RobH done! [20:53:58] 10netops, 06DC-Ops, 06Operations: mr1-ulsfo crashed - https://phabricator.wikimedia.org/T164970#3253392 (10ayounsi) Its internal storage is corrupted, @faidon re-did the steps listed on https://phabricator.wikimedia.org/T127295 And I restored the last working configuration based on rancid and jnt. Ran "reque... [20:56:49] 10netops, 06Operations, 13Patch-For-Review: LibreNMS improvements - https://phabricator.wikimedia.org/T164911#3253405 (10ayounsi) [21:12:54] 10Traffic, 06Operations, 10Page-Previews, 06Performance-Team, and 3 others: Performance review #2 of Hovercards (Popups extension) - https://phabricator.wikimedia.org/T70861#3253457 (10Gilles) 05Open>03Resolved a:03Gilles FYI we usually link to the RAIL guidelines because they're easy to understand,... [21:28:27] 10Traffic, 10DBA, 06Operations, 06Performance-Team: Cache invalidations coming from the JobQueue are causing slowdown on masters and lag on several wikis, and impact on varnish - https://phabricator.wikimedia.org/T164173#3253496 (10aaron) The job run rate and type run rate graphs seem uninteresting in that... [21:29:18] 10netops, 06Operations, 10ops-codfw: codfw: kubernetes200[1-4] switch port configuration - https://phabricator.wikimedia.org/T164988#3253504 (10Papaul) @RobH Thanks. [21:34:08] 10Traffic, 10DBA, 06Operations, 06Performance-Team: Cache invalidations coming from the JobQueue are causing slowdown on masters and lag on several wikis, and impact on varnish - https://phabricator.wikimedia.org/T164173#3253516 (10jcrespo) I think this was a one-time user doing multiple purges, we can clo... [22:57:14] 10Traffic, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 06Operations, and 3 others: Purge Varnish cache when a banner is saved - https://phabricator.wikimedia.org/T154954#3253725 (10DStrine)