[00:05:23] <robh>	 its hooked up to the port below
[00:05:33] <robh>	 it seems a single port isnt labeled in the initial 0-11
[00:05:40] <robh>	 but they are all enabled and in teh vlan so no one noticed
[00:05:46] <robh>	 ulsfo is a fucking mess actually =P
[00:06:00] <robh>	 I want them to replace our rack doors, they have some that add 2 inches depth to things
[00:06:09] <robh>	 we have fiber optic cable sandwhiched against the metal door
[00:06:18] <robh>	 i wanna replace all the in rack runs with DAC as well
[00:06:19] <bblack>	 ok
[00:06:29] <robh>	 since we're replacing servers, may as well right =]
[00:06:34] <bblack>	 I'll note what I find once I'm done digging, re: bios and ethernet ports and blah blah
[00:06:41] <robh>	 also if we moed the networking to the middle of the racks
[00:06:47] <robh>	 we could regain some of our power plugs
[00:06:48] <bblack>	 just trying to get 4021 up and puppetized so we can edit up the puppet stuff before the real install to fix it
[00:07:09] <robh>	 right now the pdu powwer plugs are right adjacent to our routers, which cause an issue since all the DAC/fiber going to switch and router block the power ports
[00:07:27] <robh>	 yeah, i started with cp4022, but moved to cp4021 when i had an issue
[00:07:36] <robh>	 then i realized my issue but had already moved the cables over so meh.
[00:07:57] <bblack>	 woo jessie installer starting
[00:08:06] <robh>	 bblack: So ideally we should overhaul ulsfo.  it would save me time and effort if it was all at once, but i doubt we can handle that kind of downtime.
[00:08:17] <robh>	 so it is likely a two part process, replace all the exisitng cp systems in place
[00:08:31] <robh>	 then once all the ewn stuff is spun up, and re-arranging the rack is only a single day downtime, i can do that
[00:08:40] <bblack>	 we can, it's just not ideal for asia latency and interferes with other testing/comparisons ongoing
[00:08:41] <robh>	 re-arraning will give us 5 more power outlets per rack.
[00:09:06] <robh>	 and fix the rats nest of fiber in each rack
[00:11:56] <wikibugs___>	 10Traffic, 06Operations, 10ops-ulsfo, 13Patch-For-Review: replace ulsfo aging servers - https://phabricator.wikimedia.org/T164327#3233839 (10BBlack) ok I'm installing jessie onto cp4021 now (just to test configuration issues and patch up puppet for the real installs later!).  Things I found while trying to...
[00:12:54] <bblack>	    ┌────────────┤ [!!] Download debconf preconfiguration file ├────────────┐
[00:12:57] <bblack>	    │                                                                       │
[00:13:00] <bblack>	    │                         Malformed IP address                          │
[00:13:03] <bblack>	    │ The IP address you provided is malformed. It should be in the form    │
[00:13:06] <bblack>	    │ x.x.x.x where each 'x' is no larger than 255 (an IPv4 address), or a  │
[00:13:09] <bblack>	    │ sequence of blocks of hexadecimal digits separated by colons (an IPv6 │
[00:13:12] <bblack>	    │ address). Please try again.
[00:13:14] <bblack>	 ^ never seen that before in the installer heh
[00:21:19] <bblack>	 it might ultimately be that the installer doesn't like me using eth1
[00:21:26] <bblack>	 (due to cabling)
[00:22:44] <robh>	 oh, i fubard it?
[00:22:51] <robh>	 sorry =P
[00:23:04] <robh>	 but it shouldnt do that, it could be my ipv6 address isnt right
[00:23:09] <robh>	 and i didnt put in a reverse entry for ipv6
[00:23:19] <robh>	 but i didnt see one ofor the other ulsfo cp systems so i didnt wanna be different
[00:23:35] <robh>	 i'd yank out ipv6 entirely from dns and try again, heh
[00:23:38] <bblack>	 yeah we should fix ipv6, but that's not this
[00:23:52] <bblack>	 it is the eth1 thing :)
[00:24:08] <bblack>	 paravoid: you'll be interested since you just did all this work to stop assuming eth0 at runtime :)
[00:24:26] <robh>	 bblack: want me to drop a smarthands to ask them to move it to the other port on the system?
[00:24:32] <bblack>	 paravoid: files/autoinstall/scripts/early_command.sh has a line: IP=$(ip address show dev eth0 | egrep '^[[:space:]]+inet ' | cut -d ' ' -f 6 | cut -d '/' -f 1)
[00:24:37] <robh>	 otherwise has to wait for me to go back downt here
[00:24:44] <robh>	 not sure how much you want it online
[00:24:55] <bblack>	 robh: it can wait if you're going back sometime soon-ish
[00:25:29] <bblack>	 anyways, the TL;DR is our installation setup requires the primary (installation) interface to be eth0, and the cable is plugged into eth1 :)
[00:25:55] <bblack>	 the failure in such a scenario is kind of confusing! :)
[00:34:47] <bblack>	 robh: if you can fix it tomorrow or early friday yourself that's fine (if you have other reasons to be there!), otherwise yeah I guess smart-hands this one port swap (cp4021 to the other port on the same card)
[00:35:09] <robh>	 no other reason to be there until we wipe the older systems
[00:35:10] <bblack>	 robh: also maybe by the end of the week, I can have either 4 or 8 of the existing boxes ready for decom
[00:35:17] <robh>	 so i'll just email them to move it during ormal business hours =]
[00:35:19] <bblack>	 ok
[00:35:22] <volans>	 nice catch, we should probably implement the same logic as in modules/base/lib/facter/interface_primary.rb there too
[00:35:25] <volans>	 let me open a subtask
[00:40:11] <volans>	 {done}: T164444
[00:40:12] <stashbot>	 T164444: Installer assumes eth0 is the used interface - https://phabricator.wikimedia.org/T164444
[00:40:20] <volans>	 time to bed now :)
[00:41:02] <bblack>	 thanks!
[00:42:16] <bblack>	 robh: I fixed the ipv6 revdns for the new ones (the old ones were fine)
[00:46:22] <robh>	 cool, thx for that, sorry i missed it
[00:47:09] <wikibugs>	 10Traffic, 06Operations, 10ops-ulsfo, 13Patch-For-Review: replace ulsfo aging servers - https://phabricator.wikimedia.org/T164327#3233902 (10BBlack) With the cable in the second port + T164444 we're blocked on getting a successful test install.  @RobH is asking smart hands to swap the cable, and I'll proce...
[00:48:55] <paravoid>	 bblack: yeah I know about that hardcoded eth0 (there are others too)
[00:49:29] <paravoid>	 that needs to be fixed if we want to do T158429
[00:49:29] <stashbot>	 T158429: Switch to predictable network interface names? - https://phabricator.wikimedia.org/T158429
[00:50:09] <paravoid>	 I kinda like catching this kind of thing though and not wiring each system randomly
[00:51:00] <robh>	 yeah so they changed where the 10g nic plugged in compared to other servers in rack
[00:51:08] <robh>	 and they arent marked, so wasnt sure which was eth0 =[
[02:06:01] <robh>	 bblack: ok eth port moved
[02:06:16] <robh>	 you should be able to reboot (update mac if you did so already) and netboot.
[02:06:48] <wikibugs>	 10Traffic, 06Operations, 10ops-ulsfo, 13Patch-For-Review: replace ulsfo aging servers - https://phabricator.wikimedia.org/T164327#3233964 (10RobH) >>! In T164327#3233902, @BBlack wrote: > With the cable in the second port + T164444 we're blocked on getting a successful test install.  @RobH is asking smart...
[08:42:18] <wikibugs>	 10Traffic, 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Ops Onboarding for Arzhel Younsi - https://phabricator.wikimedia.org/T162073#3234312 (10ArielGlenn) 05Open>03Resolved
[09:32:24] <wikibugs>	 10Traffic, 06Operations, 13Patch-For-Review: varnish backends start returning 503s after ~6 days uptime - https://phabricator.wikimedia.org/T145661#3234480 (10ema) Today, 2017-05-04, this issue affected 4 out of 8 of the text-esams hosts roughly at the same time, resulting in a peak of [[ https://grafana.wik...
[10:27:25] <bblack>	 ema: I'd think the thread pool max thing would be effect more than cause (threads get tied up on mailbox lag issue, which results in the high backend fetch concurrency, high thread counts, fetch failures, etc)
[10:30:30] <ema>	 yeah, they certainly do start piling up before fetches start failing. There just seems to be a correlation, timewise, between reaching thread_pool_max and the start of failed fetches (but it could be just a coincidence, they pile up very fast)
[10:30:58] <bblack>	 well yeah I'd say that's not really just a correlation, it's a direct cause
[10:31:27] <bblack>	 lockups on mailbox queue => threads lag waiting => reach maximum threads => no more fetches can succeed
[10:31:55] <ema>	 at any rate it seems clear that we should alert on something else than mailbox lag (in text the fetches start failing at a much lower level of lag compared to upload). text also seems to be able to recover, while often lag on upload doesn't go down unless we restart
[10:33:48] <bblack>	 mailbox lag does seem to be the most-leading indicator though
[10:34:04] <bblack>	 it's just that there's no easy threshold that make universal sense for when it has "run away"
[10:34:11] <ema>	 right
[10:34:34] <ema>	 we could add a check on failed fetches, which seem to be something to be aware of :)
[10:34:52] <bblack>	 well
[10:35:00] <bblack>	 failed fetches are one sub-case of things that cause 503s
[10:35:16] <bblack>	 we need better 5xx alerting than the pseudo-smart threshold alerts we have now
[10:36:31] <bblack>	 ultimately this mailbox lag problem is a "varnish sucks" problem though, not a monitoring or tuning problem
[10:37:20] <bblack>	 we shouldn't be having to deal with this at all.  open source varnish4 simply cannot reliably handle large disk cache storage because it's not well-designed for it :P
[10:40:30] <bblack>	 so, we can continue to work on mitigation tactics, I think we have to at this point
[10:40:57] <bblack>	 but strategically the answer is the same as it was a long time ago: research paths off of varnish for this purpose
[10:41:13] <ema>	 yep!
[10:41:54] <ema>	 re: mitigation, we're not binning on cache_text, perhaps that could help
[10:41:55] <bblack>	 I do increasingly think there's some kind of tie-in with TTLs though
[10:42:14] <bblack>	 (that the ramp of natural TTL expiries may influence mailbox lag behaviors)
[10:42:45] <bblack>	 upload is unique in that all objects entering the cache have identical TTLs (set by us)
[10:42:59] <bblack>	 text has a bit more random variance among the different URL types
[10:43:47] <bblack>	 even on a fresh restart, we kind of expect a mix of various TTL lengths on text
[10:44:56] <bblack>	 but upload on a fresh restart goes through a pattern where objects are initially just entering cache (no nukes, no expiry), then a phase where storage initially runs out of room and nukes start happening at all, then later a phase where it's possible for hot objects (that have not been nuked for space due to being on the correct end of the LRU list) to actually start expiring en-masse almost as q
[10:45:02] <bblack>	 uickly as they initially entered
[10:46:16] <bblack>	 something in that pattern likely exacerbates the mailbox lag problem running away from us
[10:47:34] <bblack>	 re: binning on cache_text, the problem is a lot of responses there have no content-length to bin on
[10:47:55] <bblack>	 (we don't know the length until it's done streaming through the cache, and storage has to be picked right after the initial response headers arrive)
[10:49:05] <ema>	 oh right, CL
[10:50:25] <bblack>	 also, our initial guess was that the primary help from binning was limiting the size spread in a storage file, which should reduce fragmentation and excess nuke runs to free space (e.g. having to rip through 1000+ small objects to find a free whole block for a large object)
[10:51:06] <bblack>	 but the other effect to consider is that the binning also splits up the LRU list, as that's per-file instead of global (unlike mailbox and expiry heap, which are global)
[10:52:24] <bblack>	 text doesn't even have the 64-bit SMF fixup yet right? until 4.1.6 install?
[10:52:35] <ema>	 correct
[10:53:00] <ema>	 plan was to upgrade text today
[10:53:13] <bblack>	 ok
[10:53:53] <bblack>	 maybe I should re-re-run the binning stats and try for a larger count of smaller bins, too.
[10:54:00] <bblack>	 (for upload)
[10:54:18] <ema>	 a pattern I've noticed on all text backends giving trouble today is a sudden drop in cached objects some 30-40 minutes before mailbox lag starts growing
[10:54:25] <ema>	 see e.g. https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?orgId=1&from=1493873642594&to=1493883650883&var-server=cp3032&var-datasource=esams%20prometheus%2Fops
[10:54:25] <bblack>	 it will either further help with fragmentation/nuking, or further scale the set of LRU lists heh
[10:54:43] <ema>	 (without a bump in obj. expiration rate)
[10:55:15] <bblack>	 sudden drop in cached objects, without a sudden expirate rate increase, should mean heavy nuking?
[10:55:21] <ema>	 0 nuking
[10:55:32] <ema>	 (that's why it seemed odd)
[10:56:43] <bblack>	 yeah
[10:56:52] <ema>	 purges?
[10:57:05] <bblack>	 no allocator failures either, which would seem to indicate (as we've thought before) that text has enough backend cache that expiries are mostly what happens
[10:57:09] <bblack>	 oh maybe
[10:57:33] <bblack>	 nothing unusual in the crazy purge pattern though
[10:57:51] <ema>	 but then you'd expect some drop in frontend objects too
[10:58:05] <ema>	 (distribution is different and all, but still)
[11:00:39] <bblack>	 the pattern is completely different from upload really
[11:00:43] <bblack>	 there's no good warning signal
[11:01:03] <bblack>	 the lag was even zero-ish until about 06:50
[11:01:17] <bblack>	 and by then fetches were already failing
[11:01:47] <ema>	 I've added the lag/fetch failure timeframes to https://phabricator.wikimedia.org/T145661#3234480
[11:01:52] <bblack>	 it's different I'd say that perhaps this is a different problem
[11:02:12] <bblack>	 it's clearly not a slow buildup of mailbox lag + locked objects leading to a slow runaway problem
[11:02:41] <bblack>	 the small immediate spike of mailbox lag here may just be one of many expected statistical effects of whatever actually happened
[11:03:45] <bblack>	 if you look at the stats through a different mindset and don't pre-suppose storage/mailbox issues
[11:03:52] <ema>	 hard to do now! :)
[11:04:05] <bblack>	 it looks a lot more like there was a genuine reason for a rash of actual backend connection hangs/failures
[11:04:17] <bblack>	 which locked up threads and objects, and of course temp-spike other stats and mailbox, etc
[11:05:40] <ema>	 also, the issues happened in the space of ~1h on 4 different backends with different varnish-be uptimes
[11:06:24] <bblack>	 right
[11:06:26] <bblack>	 4/8
[11:06:41] <ema>	 correct
[11:06:57] <bblack>	 30, 31, 40, 42
[11:07:00] <bblack>	 odd pattern
[11:07:06] <bblack>	 the others don't show fetch failures?
[11:07:21] <ema>	 30, 31, 40, 42 ?
[11:07:33] <bblack>	 oh sorry
[11:07:41] <bblack>	 31, 33, 40, 42
[11:07:43] <bblack>	 (the cp#)
[11:07:53] <bblack>	 the total set is cp30[34][0123]
[11:08:00] <ema>	 30, 32, 41, 43
[11:08:17] <bblack>	 oh, I was looking at the wrong table heh
[11:08:24] <bblack>	 those are the unaffected hosts :)
[11:08:45] <ema>	 yes! :)
[11:09:08] <ema>	 also interesting that cp3040 was not affected despite its uptime and #objects
[11:09:16] <bblack>	 well
[11:09:28] <bblack>	 I peeked at 3031 ("unaffected")
[11:09:33] <bblack>	 it does show signs, it just didn't reach fallout
[11:09:48] <bblack>	 there are spikes in client req-rate through FE+BE, spikes in threads/fetches, and even a small spike in failed fetches
[11:10:04] <bblack>	 (fe fetch-fail)
[11:10:08] <bblack>	 all around the right timeframe
[11:10:38] <ema>	 yeah, what I meant is that cp3040 not going fully belly-up despite its uptime might be a further indication that we're looking at a different problem
[11:11:13] <bblack>	 I think this was a general issue triggered by external-traffic and/or backend-fetch (to eqiad caches and/or to applayer)
[11:11:25] <bblack>	 and different caches were affected differently due to chashing pattern and/or random chance
[11:12:33] <ema>	 not the first time we see something similar on text BTW https://phabricator.wikimedia.org/T145661#3145896
[11:13:33] <bblack>	 same approx time of day, too
[11:14:56] <bblack>	 randomly off-topic: we still have the nginx upgrade pending too
[11:15:11] <ema>	 we have like 5 things pending :)
[11:16:39] <ema>	 I'd start with the text upgrades to 4.1.6 after a coffee if you agree
[11:16:47] <bblack>	 sure
[11:21:15] * ema is still puzzled by the drops in cached objects without nuking/expirations
[11:45:55] <bblack>	 ema: mind if I squeeze in the mem patch too?
[11:46:20] <bblack>	 (so some of the newer ongoing restarts begin picking it up)
[11:46:32] <bblack>	 https://gerrit.wikimedia.org/r/#/c/324230/
[11:46:37] <ema>	 bblack: +1
[11:58:45] <bblack>	 I gotta run out for a bit, I'll be back in an hour or so tops
[12:00:01] <bblack>	 ema: also note cp4021 exists now in cache_upload.  it's not in conftool data and thus not pooled or LVSed or anything, but it's there in terms of puppet roles and ipsec
[12:00:15] <bblack>	 cumin might select it, which is fine, it's just installed to test the puppetization stuff
[12:00:22] <bblack>	 (it will get shut down and/or wiped again later)
[12:00:57] <volans>	 I think we'll soon need a conftool backend in cumin ;)
[12:01:16] <bblack>	 probably :)
[12:02:21] <bblack>	 ema: other parting thoughts: the ttl=1d change may make things better or worse with upload mailbox.  also, the doubled disk storage size per node on the new ulsfo/asia boxes is likely to make it worse :)
[12:38:02] <wikibugs>	 10Traffic, 10DNS, 06Operations, 15User-fgiunchedi: Use DNS discovery record for deployment CNAME - https://phabricator.wikimedia.org/T164460#3235053 (10Zppix)
[12:39:01] <godog>	 bblack: in case it might be helpful, for the swift goal I'm running some analysis on both thumb sizes that we're storing and sizes that are requested, e.g. https://phabricator.wikimedia.org/T162796#3221138 for example to play with not storing certain sizes in cache at all
[13:03:05] <bblack>	 godog: am I reading that right, that 95% of reqs to swift are in the 0-500 byte size range (file size)?
[13:03:53] <bblack>	 oh wait, maybe that's "pixels"
[13:05:07] <bblack>	 but still, if it's up to 500 pixel count, that's images that are at most on the order of 25x20 or so
[13:05:18] <bblack>	 little tiny icons and buttons or whatever
[13:06:07] <bblack>	 ema: also I guess we'll have to pause upgrade-restarts for the services switchback at 14:30
[13:07:20] <ema>	 bblack: yup
[13:09:02] <bblack>	 I want to push the maps+upload thing "soon" too, but probably need the "auto wipe storage on restarts" thing + storage re-tweak first just to be safer
[13:09:29] <bblack>	 basically I'm trying to cram in any relatively-safe upcoming things by the end of next week (so nginx too)
[13:10:03] <bblack>	 so we can get a fairly clean (of big changes) comparison of "normal" for the week of the 15th to compare with BBR on the week of the 22nd
[13:10:27] <bblack>	 and then also the maps+upload thing helps move things along in ulsfo
[13:11:28] <bblack>	 I'm still not sure what to do about misc, as the VCL changes are a bit crazy to do in a rush there.  I had originally thought I'd just swap codfw's misc IP into ulsfo's geoip info, but that's not quite right either
[13:11:40] <bblack>	 as it would make simple DNS downtimes per-DC not work as expected
[13:12:11] <bblack>	 so maybe need a second map that has less than all datacenters (a separate map for misc, basically)
[13:12:41] <bblack>	 not having misc in ulsfo probably annoys a few other things, e.g. puppet assumptions in the ipsec module and such
[13:13:24] <godog>	 bblack: yeah 95% in 0-500 pixels range but to cache_upload not to swift, i.e. from webrequest in hive (I've clarified the comment)
[13:13:33] <bblack>	 ah ok
[13:14:36] <bblack>	 oh wait
[13:15:11] <bblack>	 "500 pixels" does not mean 20x25, it means Nx500?
[13:15:16] <bblack>	 that would make more statistical sense :)
[13:16:22] <godog>	 yeah sorry it is unclear from the task, when "pixels" is mentioned that's image width, i.e. the size we can extract from the url alone
[13:16:31] <godog>	 I'll make it more clear in the task
[13:17:27] <bblack>	 yeah so roughly speaking, 99% of reqs are for image sizes <= 1000px width
[13:17:32] <bblack>	 that seems about right
[13:17:57] <bblack>	 (and 95% <= 500px width)
[13:20:07] <godog>	 yeah, swift access logs are not in hive but we can get some days in there and see how the distribution looks like
[13:20:21] <bblack>	 godog: does swift track last-object-access in the data?
[13:20:43] <godog>	 bblack: no only last modified
[13:20:44] <bblack>	 although I guess maybe replication and other mechnical things might mess that up anyways
[13:20:47] <bblack>	 ah ok
[13:21:21] <bblack>	 and I guess it would be a ridiculously intense process to actually compare all the filenames to 90d hive list of accesses
[13:21:44] <bblack>	 so you're just going by outlier sizes 
[13:21:46] <bblack>	 ?
[13:22:29] <godog>	 no the table in hive has actually all pixel sizes for all requests, from webrequest
[13:22:47] <bblack>	 from a programmer perspective, this would be a great job for a huge bloom filter :)
[13:22:49] <godog>	 took hive about an hour to run that query and generate the able
[13:23:25] <bblack>	 insert all the 90d-accessed image filenames from hive into a bloom filter, and then rip across all the swift objects and delete if the bloom filter says they're not in the set
[13:24:18] <godog>	 hehe yeah I'd imagine that would be a nasty join for hive to do
[13:24:45] <bblack>	 I think such a bloom filter would be quite small
[13:24:50] <godog>	 OTOH it takes hive ~30s to run through all 900M thumbnail filenames
[13:24:55] <bblack>	 (as a way to abstract the problem)
[13:26:39] <godog>	 but yeah sth like that, I haven't fully thought about what would make sense to start pruning first, the list of all thumbnails as of yesterday is in hive now tho
[13:28:02] <bblack>	 e.g. if we assumed that the unique filenames accessed (from hive) in the last 90d is 100M (probably a high number), and we want the bloom filter to be 99% accurate, that's roughly a 2Mbit bloom filter (~255KB in size).
[13:28:46] <bblack>	 you'd iterate all of those filenames from hive and stuff through a bloom creator that results in this 255KB chunk of data
[13:29:22] <bblack>	 and then iterate all the swift objects using the bloom filter, and it will check each one and either definitely say "definitely not accessed in last 90d" or "accessed in last 90d with 99% probability"
[13:29:37] <bblack>	 delete everything in the first category
[13:30:24] <bblack>	 but I don't know of any great generic CLI tools for doing bloom operations like that
[13:31:27] <godog>	 good idea! yeah even coding sth up is on the cards, seems worthwhile
[13:32:33] <bblack>	 great programming idea for some bored person in the world: write some CLI-usable generic bloom filter tools, such that you can do things like:
[13:32:55] <bblack>	 cat foo |bloom --create myfilter.bloom --bits 2M
[13:33:14] <bblack>	 cat bar|bloom --test myfilter.bloom
[13:35:20] <bblack>	 hmmm: https://www.npmjs.com/package/bloomfilter
[13:35:35] <bblack>	 but it probably has bugs because it does 64-bit integer math on floats or something
[13:35:36] <godog>	 indeed, I recall we had a similar use case when drafting the plan to move from filenames to content hash in swift, but that was https://github.com/armon/bloomd
[13:35:56] <volans>	 cit. Although the bloom filter requires k hash functions, we can simulate this using only two hash functions. In fact, we cheat and get the second hash function almost for free by iterating once more on the first hash using the FNV hash algorithm.
[13:36:58] <volans>	 there are a bunch of pypi packages: https://pypi.python.org/pypi?%3Aaction=search&term=bloom&submit=search
[13:39:52] <bblack>	 yeah https://github.com/jaybaird/python-bloomfilter/ looks nice
[13:45:59] <godog>	 I'm running a query on yesterday's upload webrequest to get an idea
[13:46:00] <godog>	 INFO  : Hadoop job information for Stage-1: number of mappers: 1790; number of reducers: 1
[13:47:11] <wikibugs>	 10Traffic, 06Operations, 06Performance-Team, 13Patch-For-Review: Evaluate/Deploy TCP BBR when available (kernel 4.9+) - https://phabricator.wikimedia.org/T147569#3235271 (10BBlack) Interesting data on the topic of BBR under datacenter conditions (low latency 100GbE), possibly supporting the idea that it's...
[14:11:49] <wikibugs>	 10Traffic, 10Monitoring, 06Operations: Add node_exporter ipvs ipv6 support - https://phabricator.wikimedia.org/T160156#3235386 (10fgiunchedi) Issue has been fixed upstream, pending next node_exporter release or internal package build
[14:21:27] <moritzm>	 https://www.openssl.org/blog/blog/2017/05/04/tlsv1.3/
[14:22:45] <bblack>	 nice guide!
[14:23:21] <moritzm>	 but no openssl 1.1.1 in May
[14:26:32] <ema>	 I've stopped the cache_text 4.1.6 upgrades/restarts meanwhile 
[14:27:23] <ema>	 19 upgraded, 12 left to go
[14:38:16] <wikibugs>	 10netops, 06Operations, 05Prometheus-metrics-monitoring, 15User-fgiunchedi: Replace old Prometheus VMs addresses with baremetal in firewall configuration - https://phabricator.wikimedia.org/T164495#3235452 (10fgiunchedi)
[14:50:56] <wikibugs>	 10netops, 06Operations, 05Prometheus-metrics-monitoring, 15User-fgiunchedi: Replace old Prometheus VMs addresses with baremetal in firewall configuration - https://phabricator.wikimedia.org/T164495#3235548 (10akosiaris) 05Open>03Resolved Done. cr1-eqiad, cr2-eqiad updated and now the term is  ``` from...
[15:14:30] <bblack>	 ema: if you want to resume cache_text btw should be fine
[15:14:42] <bblack>	 (maybe you already did for all I know!)
[15:16:08] <ema>	 bblack: hihi, no! I was currently trying to parse `ss -tin` in python because apparently nobody ever did :P
[15:16:13] <ema>	 resuming he upgrades
[15:16:18] <ema>	 s/he/the/
[15:20:14] <bblack>	 lol
[15:20:20] <bblack>	 ss -tin for BBR monitoring?
[15:21:45] <ema>	 bblack: that's what prompted looking into it, yes, but it seems to be interesting information to monitor in general
[15:22:14] <bblack>	 yeah I guess as aggregates?
[15:22:22] <bblack>	 it would be crazy to send all individual connections to prometheus
[15:22:40] <ema>	 right yeah godog knows where I live
[15:22:58] <bblack>	 :)
[15:23:09] <bblack>	 also I think the fields will change per congctl
[15:23:29] <ema>	 I was thinking of averages perhaps? Although it might be interesting to distinguish between our private WAN and the public Internet, among other things
[15:24:08] <bblack>	 does netstat already have versions of some of these stats averaged over all tcp conns?
[15:24:40] <ema>	 netstat -s / nstat you mean?
[15:25:23] <bblack>	 yeah
[15:25:27] <bblack>	 I guess not avg RTT and such
[15:25:40] <ema>	 not AFAIK
[15:25:48] <bblack>	 but really our RTT won't shift much from a congestion control change
[15:25:56] <godog>	 hehe we can provision your own 'traffic' instance of prometheus and you can go nuts on that :)
[15:26:11] <bblack>	 I'm trying to think what *would* be useful to monitor from ss output on bbr-vs-cubic
[15:26:12] <godog>	 but yeah it'll likely be too much cardinality to track client addresses
[15:26:33] <bblack>	 average pacing_rate?
[15:27:25] <ema>	 there's no pacing_rate in bbr's ss output
[15:27:28] <bblack>	 ah
[15:28:08] <ema>	 lots of other cool stuff, like the estimated bandwidth, but no pacing_rate
[15:28:51] <bblack>	 so we may not get much direct comparison from connection stats anyways?
[15:29:15] <ema>	 we'd want to monitor throughput mainly, I imagine
[15:30:17] <bblack>	 I guess what I mean is: is there a meaningful measure of something like "throughput" that exists in equivalent/comparable forms in both cubic+bbr stats?
[15:30:43] <ema>	 "send" perhaps, which I think should be the amount of data sent on that socket
[15:31:18] <ema>	 so the amount of data sent over "comparable sockets" in a certain amount of time
[15:31:30] <bblack>	 :)
[15:31:34] <ema>	 :)
[15:32:17] <bblack>	 yeah but since connections are short-ish (at least, ones filled with data)
[15:32:30] <bblack>	 once you average out to even a minute or two they'll look basically-identical
[15:33:11] <bblack>	 you'd expect if BBRs making a difference you could see in "send", it would be that the same 2.3MB of data was sent to the client in the first 200ms of the connection instead of 350ms or whatever
[15:33:43] <bblack>	 they'll all reach the same number eventually, and "eventually" is rather soon for timescales we can monitor effectively and average/compare
[15:34:20] <ema>	 yeah
[15:34:32] <bblack>	 and then there's the fact that lots of our TCP spends lots of time idle regardless
[15:34:49] <bblack>	 e.g. http/2 keepalive conns sitting around after the first (and commonly only) pageview, etc
[15:38:56] <bblack>	 ema: I'm gonna try nginx upgrade on one of the live cache_misc nodes and go from there, maybe try to get maps+misc done today
[15:39:12] <ema>	 bblack: cool
[15:55:24] <ema>	 bblack: I've merged the BBR patch and changed cp1008's queue discipline for eth0 to fq in case we want to go crazy with iperf and friends
[15:56:12] <bblack>	 ok
[15:56:37] <ema>	 CC: moritzm ^
[15:59:06] <bblack>	 ema: we might need to consider some FQ tuning for the caches' case too
[16:00:03] <bblack>	 the ones I worry about are the defaults for that might need scaling for high connection concurrency are: limit=10000 and buckets=1024
[16:00:39] <bblack>	 buckets=1024 might be low (but who knows) for connection counts we commonly see, to sort out all the flows efficiently
[16:01:08] <bblack>	 and I worry about the limit=10K (packets enqueued in fq) being small for so many conns, too, seeing as the default per-flow hard cap is 100
[16:01:36] <bblack>	 but I donno, maybe the defaults are fine because fq shouldn't be bloating up that much anyways as it enqueues to the network card
[16:01:59] <bblack>	 (at least limit might be fine, maybe not buckets? should limit maybe be close to our txqueuelen?)
[16:03:02] <bblack>	 hmm I guess even our eth0 txqueuelen is actually still at defaults on cp*, we only tuned it upwards on lvs
[16:03:14] <bblack>	 (to 10-20K)
[16:03:18] <bblack>	 default is 1k
[16:03:20] <ema>	 re: limit, we can just monitor dropped packets and tune if necessary I guess?
[16:03:28] <bblack>	 maybe?
[16:03:39] <ema>	 :)
[16:03:53] <bblack>	 where/how does fq drop packets due to "limit" statswise?
[16:04:11] <ema>	 according to man tc-fq:
[16:04:20] <ema>	    limit
[16:04:20] <ema>	        Hard  limit  on  the  real  queue  size.  When this limit is reached, new packets are
[16:04:23] <ema>	        dropped. If the value is lowered, packets are dropped so that the new limit  is  met.
[16:04:26] <ema>	        Default is 10000 packets.
[16:04:32] <ema>	 where and how, I have no clue
[16:04:42] <bblack>	 oh it's shown in the example
[16:04:55] <ema>	 ah cool!
[16:05:36] <ema>	 that is a good metric to put in prometheus
[16:05:59] <ema>	 dropped / overlimits / requeues
[16:06:43] <bblack>	 root@cp1008:~# tc -s -d qdisc show dev eth0|grep dropped; // Sent 10059187 bytes 76566 pkt (dropped 0, overlimits 0 requeues 3) 
[16:06:59] <bblack>	 yeah
[16:08:33] <bblack>	 aside from the public-side effect on client conns
[16:08:54] <bblack>	 I'm betting it helps with our miss/pass connections from the edges back to the core too
[16:09:50] <ema>	 I think we should have backend response time data in graphite to look at
[16:10:20] <bblack>	 send 641.8Mbps pacing_rate 1946.1Mbps
[16:10:27] <bblack>	 I was like "wtf could be sending that on cp1008"
[16:10:31] <bblack>	 kafka of course :P
[16:10:59] <bblack>	 ema: +1 on backend response time, but it really needs to be per-backend
[16:11:52] <bblack>	 kinda like the per-backend response codes stuff that's in grafana now, but is horrible/hacky
[16:12:10] <bblack>	 https://grafana.wikimedia.org/dashboard/db/experimental-backend-5xx?orgId=1
[16:16:59] <ema>	 bblack: yeah so we have the data but we don't have any nice dashboard AFAIK https://graphite.wikimedia.org/render/?width=586&height=308&_salt=1493914579.921&target=varnish.eqiad.backends.be_cp2004.GET.count&target=varnish.eqiad.backends.be_cp2004.GET.mean&target=varnish.eqiad.backends.be_cp2003_codfw_wmnet.GET.count&target=varnish.eqiad.backends.be_cp2003_codfw_wmnet.GET.mean
[16:19:10] <ema>	 count is useless there of course, the metric is GET response time in ms with cp2003 as a backend: https://graphite.wikimedia.org/render/?width=586&height=308&_salt=1493914707.82&target=varnish.eqiad.backends.be_cp2003_codfw_wmnet.GET.mean
[16:19:31] <bblack>	 yeah
[16:19:50] <bblack>	 how to make that pretty for all the potential uses is tricky (both the per-be error codes and per-be timings)
[16:21:41] <ema>	 we just need a graph showing our CDN topology with edges between the nodes that you can click on and get all the performance metrics you might need to care about
[16:21:47] <ema>	 :)
[16:24:51] <ema>	 5 text nodes left to upgrade
[16:24:54] <bblack>	 yeah and a VR flythrough as we drill into it right?
[16:25:22] <ema>	 like IRIX's 3D file manager, the one from Jurassic Park
[16:25:55] <bblack>	 https://www.youtube.com/watch?v=J1VE6C0H2bU
[16:26:05] <bblack>	 yeah that's what I was just digging up a link for lol
[16:35:36] <ema>	 bblack: should we merge the final ttl/keep swap commit? 
[16:35:56] <ema>	 so that next week doesn't have many changes with an impact on performance for BBR evaluation
[16:36:30] <bblack>	 we still have next week to go too
[16:36:51] <bblack>	 basically May 15 week is "no major changes that impact BBR eval" and May 22 week is "turn on BBR and compare to May 15 week"
[16:36:59] <ema>	 oh right! 
[16:38:12] <bblack>	 ema: but that being said, today's a decent day to do the ttl/keep switch, yes
[16:38:32] <bblack>	 if it's going to have a negative impact on hitrates, well, it won't be so awful we can't live with it for a weeke
[16:38:35] <bblack>	 nd
[16:39:03] <bblack>	 if it's going to have a negative impact on the mailbox stuff about 24h in, at least that will have it happening just before we all leave for the weekend and we can revert
[16:39:24] <bblack>	 (I don't think it should, since expiry shouldn't happen until ttl+keep, but still)
[16:40:31] <bblack>	 the other stuff I want to push through before May 15 is finishing nginx upgrades, and merging maps into upload (assuming no snags on that)
[16:40:53] <bblack>	 and the latter semi-blocked on re-shuffling the storage binning and deploy some kind of wipe-on-restart
[16:41:12] <bblack>	 (and of course finish the 4.1.6 upgrades too)
[16:41:25] <ema>	 bblack: could you please double-check the patch? Seems simple enough, but this being VCL... https://gerrit.wikimedia.org/r/#/c/343845/
[16:42:45] <bblack>	 ema: was there a reason for the re-ordering or just arbitrary?
[16:43:09] <bblack>	 (the re-ordering of setting ttl then grace then keep, vs now keep then ttl then grace)
[16:43:28] <ema>	 yes there was a great reason that I'm trying to remember
[16:43:42] <bblack>	 because setting one of them re-sets another one internally?
[16:44:16] <ema>	 oh because we use beresp.ttl as a cap for keep
[16:44:20] <bblack>	 oh, right :)
[16:44:37] <bblack>	 which isn't really exactly right either, but it's a step in the right direction for now
[16:45:32] <bblack>	 ideally we trust that apps don't violate IMS and just unilaterally set keep to 7d, but meh for now
[16:46:05] <bblack>	 we can fix that when we get into the next iterations of related work
[16:46:35] <bblack>	 (about having a longer grace inside of the original TTL, and about aligning the TTLs cross-layer using surrogate-control and all that)
[16:48:08] <bblack>	 lgtm!
[16:48:12] <ema>	 yay!
[16:48:58] <bblack>	 I think the longer-grace-inside-TTL has the potential to improve our hitrate too
[16:49:24] <bblack>	 well at least improve the p99 part of it so to speak
[16:49:37] <bblack>	 it should make it more common to refresh semi-hot content async instead of sync
[16:54:07] <ema>	 and perhaps we'll see inter-DC link usage going down even :)
[17:05:43] <bblack>	 yeah from keep too :)
[17:19:28] <ema>	 cache_text upgraded to 4.1.6. See you tomorrow!
[17:26:30] <bblack>	 cya
[18:05:44] <wikibugs>	 10Traffic, 06Operations: Build nginx without image filter support - https://phabricator.wikimedia.org/T164456#3236376 (10BBlack) Yeah, @Faidon has brought up a similar argument before on a slightly different level: that we shouldn't be using nginx-full on most hosts anyways, since we use virtually none of the...