[08:10:57] elukey: sweet, thanks for the heads up! I'll take a look [08:19:12] <3 [09:33:33] volans: i hear you have Opinions about python :) [09:36:17] kormat: lol, I tend to do that, yes :-P [09:36:46] i've found some disgusting bash that marostegui wrote, and it desperately needs pythonizing. [09:36:54] hahaha [09:38:05] without context sounds a great plan :) [09:38:16] volans: with context it sounds even better, believe me. [09:38:44] so, i'm wondering if you could point me to some sample code we already have, so i can get a feel for the style/approach that we're using here [09:39:23] py2+eval for everything, obviously [09:39:40] sure, actually I'm wondering if might be useful to have an onboarding chat about it [09:40:31] that sounds like a really good idea from my pov [09:41:09] jayme, you might be interested? ^^ [09:41:56] things like coding style, logging, error handling, type checking (mypy), packaging [09:42:08] and in particular deploying ;) [09:42:17] trust me, that's the messy part [09:42:18] volans: of cause! Thanks for the ping [09:42:33] kormat: can that pythonification wait for tomorrow? [09:42:50] i don't know, marostegui is a hard taskmaster [09:43:37] It can definitely wait [09:43:47] ahh. he's nice in _public_. now i understand. [09:43:55] hahahaha [09:43:58] lol [09:44:17] * kormat makes a note [09:44:29] Let's have a chat in private, ok? [09:44:31] XD [09:44:56] i'd prefer to use publicly logged channels. *cough* [09:45:35] * marostegui starts a videocall with kormat [09:45:42] * kormat is never heard from again [09:46:48] livestream it on youtube :-P [09:47:30] like gitlab did [09:48:00] you got my quote ;) [10:07:01] kormat, I'd add unit and integration testing to the list of things to cover in that Python discussion, if there's time [10:40:58] <_joe_> kormat> py2+eval for everything, obviously [10:41:02] * _joe_ giggles [10:44:07] <_joe_> pybal (master=)$ git grep -F 'eval(' | wc -l [10:44:08] <_joe_> 7 [10:44:12] <_joe_> kormat: ^^ [10:46:15] <_joe_> don't worry, that's just our loadbalancer. But at least now it doesn't eval() things it downloads via http from another server. [10:53:11] liw: excellent point [10:53:30] _joe_: people keep telling me to not worry here. it's getting worrying. ;) [10:54:11] kormat: don't worry, it won't get worryinh [10:54:18] hahah [10:57:50] don't worry, it is worrying all the time! [10:58:28] :-D [11:00:23] if it wasn't worrying I'd be worried [11:06:23] <_joe_> kormat: the silver lining is: it's got better over the years [13:47:52] kormat: if you haven't already had the python discussion, please attempt to bait volans into discussing 'wmfpylib' [14:52:00] o/ [14:54:00] chaomodus: o/ hope you enjoyed your time off :) not too much from clinic duty, some open access requests that require help from others (e.g. T250189) [14:54:01] T250189: LDAP access to the wmf group for Sam Walton - https://phabricator.wikimedia.org/T250189 [14:54:28] cool i shall begin hearding [15:34:03] <_joe_> cdanis: weren't we gonna create it without volans knowing? [15:34:10] that's a good idea [15:34:17] <_joe_> else there will be 1 year of analysis paralisis [15:35:02] <_joe_> then 6 months of nitpicks, add 3 months for fixing all the lint issues that all the linters he added are now reporting as incorrect... [17:31:07] XioNoX: I don't know where else to look, and I suspect provider/IX problems or something, but RIPE Atlas connectivity to esams has been quite bad today [17:31:10] https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-target_site=eqsin&var-target_site=esams&var-ip_version=All&var-country_code=All&var-asn=All [18:41:38] My internet is down and 4G decided to downgrade to HSPDA [18:41:54] cdanis: I'm hearing things about LibertyGlobal having issues [18:42:11] someone shared http://bix.hu/statisztika/liberty_global/d5f78a7e4ebbab8166f23d8a15e42263 [18:42:21] whoa [18:42:53] in other news, I think we're in the middle of a medium mediawiki outage [18:43:08] certainly a latency spike [18:43:11] we should encapsulate your brain into an icinga alert [18:44:05] XioNoX: lol, those graphs are in local time [18:44:26] and Budapest is +2 so yeah, that lines up with the RIPE Atlas stats [18:45:02] cdanis: people say it's global too, multiple countries [18:46:31] I'm still kind of alarmed by the bigger IPv4 dropout from 05:20-06:50 [18:47:26] quite long, and quite large (under 96% of RIPE Atlas could reach esams over IPv4, usual values are north of 2.5 nines) [18:49:25] doesn't show up on netflow [18:49:51] yeah, and I couldn't find it on traffic graphs either [18:50:01] makes me think that it's from origins that aren't geodns'd to esams [18:51:00] ah yeah possible [18:51:19] latency noticed in #wikimedia-tech [18:51:29] feature request for the rope atlas probe exporter :) [18:51:34] ripe* [18:51:38] eheh [18:51:46] we do have per-probe data in prometheus, XioNoX [18:53:02] cdanis: could be useful to have a visualisation of probes/as/countries that are under a certain threshold [18:53:18] not sure how and if it's doable though [18:54:20] i'd like to add a 'continent' annotation to each country [18:54:23] so you can filter on that too [18:54:36] or an 'expected site' for each country, based on our geodns mapping [18:55:27] yeah that was the feature request I mentioned above ^ [18:55:32] :) [18:56:08] i think there's a map kind of panel for grafana [18:58:04] ah, wasn't sure what you meant [18:59:24] chaomodus: https://phabricator.wikimedia.org/T167689#5780784 [18:59:43] XioNoX: *spidermans meme* [19:09:52] chaomodus: https://phabricator.wikimedia.org/T251184 [19:14:00] cdanis: now that I'm in this rabbit hole, this visualisation might be useful too: https://grafana.com/api/plugins/raintank-worldping-app/versions/1.2.7/images/img/wP-Screenshot-dash-summary.png same, not sure if possible out of the box [19:15:41] XioNoX: https://w.wiki/PFE [19:16:10] AT,CZ,DE [19:16:20] nice! [19:17:59] very possibly one ISP with that geography [19:18:03] or something [19:19:00] I think it's possible to do horrible things with Grafana https://grafana.com/grafana/plugins/jdbranham-diagram-panel [19:21:45] it certainly is [19:21:49] you're just writing Node code, I think [19:22:12] sorry, not Node; Angular and/or React [19:22:18] got my JS horrors confused [19:23:14] <_joe_> cdanis: did you make anything out of the mw slowness? [19:23:38] _joe_: no, aside from "not databases or obviously memcached related, and also seemed unlikely to be the swat that was done" [19:23:54] I didn't dig into investigating if it was a new scraper or bot, but that would be my next guess [19:23:54] cdanis: just got back from lunch, sorry to miss that :( [19:24:08] <_joe_> yeah was my best guess as well [19:24:11] https://grafana.wikimedia.org/d/ifM0GzjWk/cdanis-xxx-php-worker-threads?orgId=1&from=now-12h&to=now [19:24:16] the worker thread situation is worrying [19:26:40] <_joe_> not really, we still have 1k to spare even in the worst moments [19:26:54] <_joe_> but yeah I'd look at apache logs for slow requests [19:27:10] we don't know that _joe_ [19:27:49] and the overall # free doesn't matter for tail latency; what matters for tail latency is that many appservers have some small reserve [19:27:59] and per the last two graphs we're violating that pretty regularly [19:28:51] also, the stats we have on worker threads are very limited, both in that you need a free worker thread to sample the statistic itself, and in that it's just whenever prometheus happens to scrape you, so there are saturation events you can miss [19:28:51] <_joe_> yeah I'm having a hard time finding one appserver with zero available workers though, lemme look at the cluster metrics [19:30:11] you sould be able to get a set of times that a given server had 0 workers available from the tooltips on the graph of https://w.wiki/PFK [19:32:32] <_joe_> this makes no sense [19:32:41] <_joe_> all servers have normal cpu usage [19:35:20] <_joe_> https://grafana.wikimedia.org/d/000000550/mediawiki-application-servers?orgId=1&var-source=eqiad%20prometheus%2Fops&var-cluster=appserver&var-node=mw1273 [19:35:28] <_joe_> this makes absolutely no sense to me [19:35:37] <_joe_> let's go look at the slowlog on that server [19:36:10] <_joe_> uhm, can someone check which servers are in the same rack as mw1273? [19:36:21] looking [19:36:49] mw1267 through mw1283 [19:36:52] _joe_: ^ [19:37:04] https://netbox.wikimedia.org/dcim/racks/7/ [19:37:38] <_joe_> so the slowlog is all queries [19:38:02] <_joe_> this means, most requests taking more than 15 seconds were in the middle of a sql query [19:39:36] are you suspecting access switch saturation? [19:39:55] <_joe_> It's one possibility, yes [19:40:08] <_joe_> so looking at https://config-master.wikimedia.org/pybal/eqiad/appservers-https [19:40:24] <_joe_> two servers out of that group are at reduced weight [19:41:37] <_joe_> now look at mw1263, another rack, weight 30 [19:41:38] <_joe_> https://grafana.wikimedia.org/d/000000550/mediawiki-application-servers?orgId=1&var-source=eqiad%20prometheus%2Fops&var-cluster=appserver&var-node=mw1263 [19:42:12] <_joe_> look at the first graph, workers saturation [19:42:20] <_joe_> anything above 100 is problematic :) [19:42:24] <_joe_> https://grafana.wikimedia.org/d/000000550/mediawiki-application-servers?orgId=1&var-source=eqiad%20prometheus%2Fops&var-cluster=appserver&var-node=mw1268 [19:42:36] <_joe_> this is mw1268, same hardware as mw1263 [19:42:41] <_joe_> same weight in the LB [19:44:17] <_joe_> let's hope that's not it though, [19:49:44] there's no drops or errors shown on mw1263's network stats [19:50:15] I can't make too much of the discard stats shown on https://librenms.wikimedia.org/device/device=160/tab=ports/view=minigraphs/graph=errors/ except that there are more discards torwards the crs than I expected [19:53:08] there are librenms alerts for the switches having too high a utilization on an individual port [19:53:32] which i think should also cover the 'vcp' ports that make up the virtual chassis [19:54:08] speaking of, there's been a handful of errors on this port since two weeks ago, but it doesn't look like much https://librenms.wikimedia.org/device/device=160/tab=port/port=18711/ [20:03:35] okay, I've failed to find evidence of access switch saturation [20:03:45] the latency spikes are not as bad as they were [20:16:21] today is the worst sustained appserver latency in a month and I have no idea why, sigh [20:21:47] cdanis: I think those drops are because the uplinks are 4x10G and LACP balance them per flow, so some flows can briefly saturate the physical uplinks [20:22:08] makes sense [20:22:16] the error link is no factor year, we have an alert if they trigger [20:22:28] s/year/yeah [20:22:31] yeah didn't think so either, just happened to notice it [20:22:42] the per-flow balancing is done to avoid needing to worry about out-of-order delivery, which TCP interprets as possible packet loss? [20:22:58] haha I was like "there are errors I'm not aware of?!" [20:23:05] lol yeah like 5 [20:23:44] cdanis: yeah exactly about the out of order [20:24:46] fix is to upgrade those to 1x40G links or more, which we will be able to with new linecards [20:29:15] cdanis: I didn't follow the whole story, but for example mw1263 has under 3 TCP retransmits/s out of 30k segments/s https://grafana.wikimedia.org/d/000000365/network-performances?orgId=1&var-server=mw1263&var-datasource=eqiad%20prometheus%2Fops [20:29:23] ah! forgot about TCP retransmits [20:29:38] I think we're thinking the loss would be on services talking towards the mws though [20:31:38] cdanis: no significant increase https://grafana.wikimedia.org/d/000000366/network-performances-global?orgId=1&from=now-12h&to=now [20:31:42] yeah [20:31:57] what's up with the esams cache-text out dest-unreach? [20:34:06] cdanis: good questions, I dived on them a while ago but forgot what they were from since [20:34:18] I should check that dashboard more regularly [20:47:37] cdanis: culprit is maybe 20:42:47.928961 IP cp3064.esams.wmnet > recdns.anycast.wmnet: ICMP cp3064.esams.wmnet udp port 33145 unreachable, length 132 so maybe one side closing the socket too soon? [20:48:15] anyway, I'll look more at it tomorrow [20:49:18] whoa [20:49:37] uh getting unreachables on recdns seems bad [20:50:54] they don't match the graph frequency though [20:52:05] hm okay [20:52:12] I'm quite tired, I'm going to stop looking for now [20:53:07] it's only cp3064 too [20:55:52] AH! [20:55:57] -i any [20:56:14] 20:55:49.147010 IP text-lb.esams.wikimedia.org > text-lb.esams.wikimedia.org: ICMP XXXX unreachable - need to frag (mtu 576), length 556 [20:56:27] XXXX is some client IP [21:07:04] so PMTU to that specific client IP is 576, `cp3064:~$ ip route get to 194.44.x.x` [21:10:19] but for some reason cp3064 tries to send over and over a 578 byte IP packet (with a HTTPS payload, I think the handshake's ACK) [21:10:43] probably can't go under because of all the cipher lists? [21:11:46] So as it's too big, cp3064 sends a ICMP packet too big (aka frag needed) to itself (cp3064) [21:12:26] of course the "don't fragment" bit is set on the initial packet [21:21:27] So it's a broken client, but I'm wondering if ATS could/should fragment it (or not set the don't fragment bit)