[01:14:19] it could periodically write the facts to a path and that just doesn't happen if it can't access netbox [01:14:48] (i mean, some external mechanism could, into a path that the puppet master looks at) [04:00:03] I think that's the best option [04:00:12] and add a simple alert if the file is too stale [04:54:20] robh: dbprov1001 (A5) and dbproxy1012 (A7) are having PS redundancy alert, can you guys take a look, should I create a task? [09:00:29] mutante: re:db with scap, you can easily check on puppetboard (catalog tab) if that comes from puppet (as it does), just go to https://puppetboard.wikimedia.org/catalog/db1107.eqiad.wmnet and put "scap" in the filter at the top ;) [09:03:44] paravoid: re:how to expose rack to cumin. If that's the only problem it will not be anymore a problem very soon, the Netbox backend for cumin is in the final stage of review, soon to be merged and will allow to things like 'P{normal_puppdb_query} and N{site:eqiad and rack:A1}' [09:14:12] catched up with backlog now /o\ [11:26:28] marostegui: dbprov1001 had one yesterday but i fixed it [11:26:37] go ahead and make a task for them please if they are still in alert [11:26:43] odd it has one now [11:33:48] robh yeah but it showed up again [11:34:18] maybe loose cable? [11:41:07] perhaps, will check later today =] [11:43:19] you want me to make the task? [11:50:46] sorry, was eating breakfast [11:50:57] if you dont i will [11:55:11] no worries, I will create one for both hosts [12:05:33] thx =] [12:05:41] headed in shortly, just drinking coffee and playing email catchup. [12:07:33] robh: I assigned it to you directly: https://phabricator.wikimedia.org/T228859 [12:07:39] thx [12:07:44] thanks! [12:11:55] so, what are we doing with yesterday's incident pad? [12:12:03] is someone creating an incident report on-wiki and filing tasks? [12:12:21] also, are we considering the logstash cascading failure part of the same incident or a separate one? [12:12:44] I guess I should be asking the incident coordinator? who was that? [12:13:16] _joe_: I was told at some point that was you, not sure if that's true though :) [12:44:55] btw Arzhel found this: https://www.irccloud.com/pastebin/zrdKmtMC/ [12:45:38] just a few CRC errors... [12:46:19] and fragments [12:46:57] and doesn't say a time period, i.e. when it was last reset [12:47:26] yeah but I bet if you looked at the other links in the virtual chassis they'd be 0 or close to it as well [12:47:40] yeah [12:47:45] both of CRC errors and [Ethernet frame?] fragments will get dropped on reception, right? [12:47:50] yes [12:50:07] <_joe_> paravoid: yeah I was, and I would consider the logstash failure a consequence of the outage, so part of the same incident [12:51:32] ok [12:51:49] so I think cdanis did an initial pass at a doc [12:52:06] but now someone needs to clean this up, post it on wiki, file tasks [12:52:23] should that be the IC, if am understanding the new process right? [12:53:24] clean this up = check for accuracy and completeness based on what happened + format [12:54:56] <_joe_> yes [12:55:08] <_joe_> I'm happy to do it :) [12:55:33] heh [12:55:38] but I do understand the process right, right? :) [12:55:38] <_joe_> but I'll need help on what you did find on the network side if needed [12:55:45] of course [12:55:50] technically I don't think we formalized the aftermath process [12:56:03] (but I think this makes sense) [12:56:19] <_joe_> cdanis: I don't think either, but 1 - makes sense 2 - I'm happy to review that [12:56:28] <_joe_> given you are busy reviewing my mtail patch [12:56:57] I don't think the pad has enough about the logstash fallout, so that needs more data too probably [12:57:11] _joe_: ahaha [12:58:00] <_joe_> ok, I didn't follow the aftermath there, so we will probably need to ask shdubsh to fill in the gaps [12:58:16] nod [12:58:22] here's another thing we should look into: why some packet loss caused so many analytics retransmits (and thus an overall increase in network traffic of something like 35Gbps) [12:58:36] I'll make sure there are links and a question on the pad [12:58:58] <_joe_> cdanis: I don't think we can properly quantify "some" [12:59:03] where are such pad? [12:59:11] i can review it as well as an outlier [12:59:20] wasnt involved in the incident [12:59:36] (I do have a hunch that analytics traffic was disproportionately affected, as I'd expect that most of their traffic is disk read RPCs that are at the MTU size, so assuming BER stays the same, more of their packets will be lost) [12:59:44] https://etherpad.wikimedia.org/p/2019-07-23-eqiad-asw-a fsero [13:03:39] there is an overarching question that I added [13:03:43] this is curious, there was a spike in database reads at the *end* of the memcached suckage: https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&from=1563905855772&to=1563918298168 [13:03:46] which is probably going to be hard to pin down [13:04:01] and that is, why is 4-7% packet loss enough to cause mayhem [13:04:03] there is an increase in connections to mysql for the duration of the incident, which makes some sense [13:04:07] on otherwise tcp flows :) [13:04:34] cdanis: yeah, I was monitoring those during the incident [13:05:59] <_joe_> cdanis: that's not completely strange. [13:06:50] perhaps requests were long-running and piled up on the appservers (blocking on memcached/mysql/other network calls) and then all started completing at once? [13:07:20] <_joe_> cdanis: no I rather think we suddenly were able to store data on memcached, and to retreive it, and find it stale [13:07:30] mmm [13:08:13] <_joe_> depending on how you use memcached, that can cause more data to be read [13:08:31] <_joe_> but I have to correlate the data and the code in mediawiki, which is not always straightforward to do [13:09:39] <_joe_> the two spikes have no correspondence [13:09:43] <_joe_> so I would rather say [13:09:59] <_joe_> it was higher throughput between appservers and databases [13:10:10] I guess... hm. mcrouter can't talk to a given memcached, it TKOs it, it re-hashes that key to another server it thinks it can talk to... then at the end once the original server is healthy again, it will find it stale, need to talk to the database [13:11:05] <_joe_> the outage ended at 20:09, the mysql spike ended at 20:04 [13:12:29] the beginning of the plummet in Analytics network traffic/retransmits is at 20:04 though [13:13:04] <_joe_> uhm, I'm pretty sure paravoid did issue the command at 20:08-something [13:13:09] indeed [13:13:11] as am I [13:13:16] so I think there was something else going on [13:13:27] I !logged it [13:13:43] yep, at 20:08 [13:13:47] elukey: is there any history of what Hadoop jobs were running at what times? [13:15:24] cdanis: hello :) yes in https://yarn.wikimedia.org/cluster but the backlog is not kept for long (since it stores things in zookeeper) [13:18:17] mmm this is a lot to sift through [13:20:45] okay, there's no single job that correlates well with the big spike in retransmits (from ~19:28-20:04) [13:33:49] robh: nitpicking: I've noticed you've been renaming devices to "WMFnnnn" -- but I think lowercase for device names is what we've been doing so far [13:34:00] (asset tags are in capital case though!) [13:34:09] sorry for the OCDing :) [13:36:35] robh: also, the old PDUs were renamed to WMFnnnn but are still set to Status: Active (which surfaces them as errors) [13:36:58] i havent worked on reports this week [13:37:09] i figured i could get a pass and fix next [13:37:11] but will fix now [13:37:22] well, after im done on cloudvirt1015 [13:37:32] cdanis: i think incident pad is ok, is clear enough and have all the needed data [13:38:25] just to be a little picky as well, i got the page late and i was missing such document during the incident i didnt know how to help and lot of people were already hands on deck so i guess for the future one having the pad is important for coordination [13:38:42] besides that nicely done everyone :) [13:38:46] paravoid: i thought we wanted to convert all the lowercase to uppercase but i suppose we only cared about the asset tag field not the hostname when used with asset tag [13:38:52] so i'll adjust accordingly [13:38:59] but i prefer if we went so they matched [13:39:03] ie: all uppercase [13:39:05] robh: yes, all asset tags uppercase, all device names lowercase [13:40:00] that's not unreasonable but maybe we can think about it in some later iteration -- the DNS generation project is going to be entangled with this perhaps [13:44:04] fsero: yeah, it's a good point, cc: _joe_ :) [13:44:38] there is already WIP for that but yeah the idea is that to add the pad/doc to the topic of the chan during incidents [13:44:47] <_joe_> paravoid: already noted [13:45:09] ack, thanks guys [13:58:00] fwiw: technically DNS is supposed to be case-insensitive, but the protocol actually can/does carry the case information, it's just "standard" that all interfaces to the data are meant to match lookups case-insensitively [13:59:04] 🤯 [13:59:21] what gdnsd does (which is within the scope of what standards allow) is it converts all names from zonefiles to lowercase internally, and matches against incoming queries case-insensitively, and returns output names that match the case as-queried (in the case the output is the name that was queried), or all-lowercase otherwise. [14:00:38] there was once a draft-standard to increase UDP DNS resiliency against forgery, to use the case bits of the query name as additional ID bits (meaning if a client queried a 23-character name, it could get 23 bits of additional ID protection by using the case bits of those characters, making them look random and mixed), which only works if servers preserve query name case like gdnsd does above. [14:00:54] it never made RFC status, but I think there are some implementations in the wild anyways, seems better to conform there. [14:02:25] yeah it's funny to see with dig +trace the various behaviours :) [14:02:40] like google respecting casing while amazon not [14:03:39] root servers too preserve it AFAICT [14:04:14] yeah the "use the case bits for ID" proposal made a lot of sense. I think what it got hung up was the minority non-standard implementations which failed to preserve. [14:05:18] really we can peel this onion back a further layer: the names in DNS are actually fully binary-capable at the protocol level (you can encode any 8-bit bytes inside of DNS labels). [14:05:54] it's really just that the matching algorithm says to ignore ASCII case on ASCII alphas which screws that up, making the matching non transparent for arbitrary binary data. [14:06:05] ahahah [14:06:40] if it weren't for that, we could've put UTF-8 directly in DNS names, instead of the ridiculous IDN scheme we eventually got with the ascii-armored xn--whatever names.e [14:07:40] I think you could teach a semester course on protocols and standards and how things go wrong over the long scale, using just the DNS RFCs to draw course material from. [14:07:45] ahaha [14:13:25] I think we could teach a couple years' worth of courses on character set hell all by itself... [14:13:26] paravoid: fixed the report errors [14:15:16] robh: I'm online earlier than expected, ready to work on the fiber when you are, see https://phabricator.wikimedia.org/T228823 [14:29:42] ahh, gimme about 15 min, need to finish up mid task work [14:29:47] and then move back to dc floor from break area [14:32:11] also chris is here so i wanna chat with him about it, since he knows where the spare fibers are =] [14:39:50] yup, we need those :) [15:06:54] FYI, we replaced the link between asw2-a6 and asw2-a7, everything looks good but let us know if there is any sign of issues [15:08:01] hm [15:08:17] seems like analytics is often suffering lots of TCP retransmits? https://grafana.wikimedia.org/d/000000562/network-errors-by-cluster?panelId=2&fullscreen&orgId=1&from=now-3h&to=now [15:08:33] cdanis: yes I was saying that yesterday [15:16:27] ehm we are working on it https://phabricator.wikimedia.org/T225296 [15:16:35] I blame Hadoop/Java settings [15:16:38] :) [15:16:40] marostegui: you about and can you join #wikimedia-dcops? [15:17:07] cdanis: https://grafana.wikimedia.org/d/000000366/network-performances-global?panelId=18&fullscreen&orgId=1&from=now-3h&to=now [15:17:18] esams:miscs is more worrying right now :) [15:17:36] heh wow [15:17:41] in % analytics is not *that* bad [15:17:57] that's a known issue I think, the esams:misc tcp retrans [15:18:32] recdns is one of the few "misc" there, and our dns monitoring (from pybal I think?) opens weird connections that always cause those stats, IIRC [15:19:26] XioNoX: thank you for the *that* term for your dear Analytics friends! [15:19:26] bblack: I think I filtered out the clusters with barely any traffic, so this might be legitimate [15:19:28] :P [15:19:55] elukey: haha 3% retransmits shouldn't happen in a DC [15:20:07] usually mean app issue or congestion [15:20:24] in esams I doubt we have either for the "misc" servers though [15:20:43] it's a stats scaling issue too [15:21:00] those servers don't really do much TCP to begin with, other than the borked monitoring traffic that causes retrans [15:21:09] so as a percentage it looks pretty awful [15:21:14] robh I'm done for the day, do you need me for something in particular? [15:21:42] elukey: wanna join #wikimedia-dcops? [15:21:52] marostegui: nope, you commented all was good from db perspective for a1 [15:22:03] i was just going to recheck since things change, etc... [15:22:28] bblack: yeah, my query excludes everything under `cluster:netstat_Tcp_RetransSegs:rate5m > 10` [15:22:52] robh yep, A1 is good [15:23:18] if I set it to 30 it's gone [15:24:25] XioNoX: so the query that detects unusual retrans by percentage, also excludes hosts which have unusually-high percentages? [15:24:26] robh if possible triple check labsdb1009 before removing it cables [15:24:28] I don't follow! :) [15:25:16] bblack: excludes hosts that have a low rate of retransmits, to avoid the usecase we're discussing [15:25:32] full query is `(cluster:netstat_Tcp_RetransSegs:rate5m > 10 ) / cluster:netstat_Tcp_OutSegs:rate5m > 0.01` [15:25:52] marostegui: we're checking cables at every step and reseating them all =] [15:25:54] so yeah [15:26:05] thanks :) [15:26:14] ok [15:26:17] so the two conditions to show up is to have more than 1% retransmits and more than 10 retransmits/m [15:26:27] /s [15:26:28] anyways, esams:misc should just be 1x bastion, 2x recdns, 1x authdns, and maybe the "spare" host [15:26:48] it's either TCP DNS monitoring causing that stat, or the bastion? [15:27:41] bblack: the culprit here is multatuli, with more than 30% TCP retransmits [15:28:21] https://grafana.wikimedia.org/d/SxmTH3IZk/arzhels-playground?orgId=1&panelId=4&fullscreen&from=now-3h&to=now [15:29:33] ok, so authdns monitoring :) [15:30:06] XioNoX: note multatuli still has SACKs disabled (we skipped it with the dnsauth reboots as the serial console was broken and a racadm racreset didn't help either) [15:30:24] ah maybe it's related [15:30:46] not a big deal (the retransmits) but I was curious on why only that one host is showing up [15:31:20] no history data? [15:31:24] https://grafana.wikimedia.org/d/SxmTH3IZk/arzhels-playground?orgId=1&panelId=4&fullscreen&from=now-7d&to=now [15:31:38] or does that mean real zeros where it's missing? [15:34:09] bblack: where it's missing was when it was a very low rate (<10p/s) I removed the condition so it shows all the data now [15:34:14] (refresh) [15:34:38] it's external, a particular peer network that I guess has bad filtering [15:35:10] "Ip Core Corporation" in Japan [15:35:21] we're sending them repeated SYNACK retrans, they never complete the handshake [15:35:36] (from several IPs, who knows, maybe something lame, but not impactful anyways) [15:35:52] ok, broken middlebox or something like that [15:35:58] thanks for having a look! [15:36:01] I assume we know for sure that we're not filtering TCP DNS at our esams borders heh [15:38:00] https://grafana.wikimedia.org/d/000000341/dns?panelId=10&fullscreen&orgId=1&from=now-7d&to=now [15:38:27] ^ all the 3x authdns (esams, eqiad, codfw) show a similar TCP pattern, big increase in TCP conns ~24h ago, trending down now [15:38:46] probably driven by some recursors out there in the world making some changes, and probably not an issue [15:41:26] labstore1007 has crazy retransmits since yesterday too - https://grafana.wikimedia.org/d/SxmTH3IZk/arzhels-playground?orgId=1&from=now-7d&to=now&panelId=2&fullscreen [15:48:39] labstore1007 is probably related to the reboots of the 1006/1007 pair [17:17:26] volans: oh, nice. yep. that confirms it's analytics-refinery that pulls it in. thx [17:17:42] yw :) [17:28:54] this co-working space i am at has a lot of IT books.. among them "Access 97", "Word 2003" and "CGI Programming with Perl" [17:42:22] * bd808 looks to see if he has "CGI Programming with Perl" on the shelf too [17:43:05] omg [17:43:09] loads of other perl books but not that one apparently [17:44:30] my copy of the camel book was very obviously heavily used [17:51:33] I don't have that but I do have the C book from 1978 with some pretty yellowed pages [17:51:39] looks like the 25th print run [17:51:48] anyways there's a delightfully useless form i the back of the book [17:52:29] tear out this page to order these other high quality c language and unix(tm) system titles in the prentice hall software series [17:53:25] if payment accompanies order, plus your state's sales tax where applicable, prentice hall pays postage and handling charges. [17:53:32] please do not mail in cash. [17:53:34] :-D [17:54:25] haha, 1978 beats this :) [17:55:56] first run was 1978, this run might have been a year or two later, hard to know [17:56:22] pretty sure they won't send me the book for a 15 day trial though like it says I can check on the form :-D [18:21:05] i have a 1975 "Computer Programming in the BASIC Language" stolen from my highschool. (i think they were using it 'til about '94.) [18:21:10] somewhat less hip than the c book. [18:22:05] i wonder why we ever installed the Apache PHP5 module on Icinga anyways.. not like it's PHP [18:22:19] going away now with "remove support for jessie" [18:31:07] volans: is there a cumin flag to get it to print all output from command execution together, and not print any of the (I guess I'd say) metadata about which servers outputted what and such? [19:02:38] cdanis: sorry cannit parse, what do you need exactly? [19:04:40] if you run with -f/--force and redirect stderr to devnull you get less stuff, but if you want to do post-processing there are ways [19:04:50] that depends on what you need ;) [19:06:24] also see https://gerrit.wikimedia.org/r/c/operations/software/netbox-deploy/+/525165 ? [19:10:59] a basic book from 1975 is pretty dang hip, don't sell it short [19:11:40] brennen: BASIC was also my first language. Commodore! [19:14:01] volans: when I'm doing evil things and using cumin as an ad hoc logs analysis pipeline, I don't want any grouping or anything [19:14:38] ok the you want to look at the wikitech page for cumin, handle output section [19:14:54] there is a txt and json output format [19:15:06] one output per host [19:15:30] with an ugly awk to get onky that (will go away, sorry) [19:16:56] I don't recall if hostnames are sorted and on mobile right now [19:17:14] cdanis: ^^^ [19:17:35] ahhh tw [19:17:38] ty [19:17:42] the ugly awk is not bad, I've written much uglier ;) [19:18:17] * volans hopes that's good enough for your usecase [19:18:30] but feel free to request features ;) [19:19:27] indeed, that's perfectly sufficient [19:19:32] thanks! [19:19:59] great, than heading for dinner :D