[08:10:57] <godog>	 elukey: sweet, thanks for the heads up! I'll take a look
[08:19:12] <elukey>	 <3
[09:33:33] <kormat>	 volans: i hear you have Opinions about python :)
[09:36:17] <volans>	 kormat: lol, I tend to do that, yes :-P
[09:36:46] <kormat>	 i've found some disgusting bash that marostegui wrote, and it desperately needs pythonizing.
[09:36:54] <marostegui>	 hahaha
[09:38:05] <volans>	 without context sounds a great plan :)
[09:38:16] <kormat>	 volans: with context it sounds even better, believe me.
[09:38:44] <kormat>	 so, i'm wondering if you could point me to some sample code we already have, so i can get a feel for the style/approach that we're using here
[09:39:23] <kormat>	 py2+eval for everything, obviously
[09:39:40] <volans>	 sure, actually I'm wondering if might be useful to have an onboarding chat about it
[09:40:31] <kormat>	 that sounds like a really good idea from my pov
[09:41:09] <volans>	 jayme, you might be interested? ^^
[09:41:56] <kormat>	 things like coding style, logging, error handling, type checking (mypy), packaging
[09:42:08] <volans>	 and in particular deploying ;)
[09:42:17] <volans>	 trust me, that's the messy part
[09:42:18] <jayme>	 volans: of cause! Thanks for the ping
[09:42:33] <volans>	 kormat: can that pythonification wait for tomorrow?
[09:42:50] <kormat>	 i don't know, marostegui is a hard taskmaster
[09:43:37] <marostegui>	 It can definitely wait
[09:43:47] <kormat>	 ahh. he's nice in _public_. now i understand.
[09:43:55] <marostegui>	 hahahaha
[09:43:58] <volans>	 lol
[09:44:17] * kormat makes a note
[09:44:29] <marostegui>	 Let's have a chat in private, ok?
[09:44:31] <marostegui>	 XD
[09:44:56] <kormat>	 i'd prefer to use publicly logged channels. *cough*
[09:45:35] * marostegui starts a videocall with kormat
[09:45:42] * kormat is never heard from again
[09:46:48] <volans>	 livestream it on youtube :-P
[09:47:30] <marostegui>	 like gitlab did
[09:48:00] <volans>	 you got my quote ;)
[10:07:01] <liw>	 kormat, I'd add unit and integration testing to the list of things to cover in that Python discussion, if there's time
[10:40:58] <_joe_>	 kormat>	py2+eval for everything, obviously
[10:41:02] * _joe_ giggles
[10:44:07] <_joe_>	 pybal (master=)$ git grep  -F 'eval(' | wc -l
[10:44:08] <_joe_>	 7
[10:44:12] <_joe_>	 kormat: ^^
[10:46:15] <_joe_>	 don't worry, that's just our loadbalancer. But at least now it doesn't eval() things it downloads via http from another server.
[10:53:11] <kormat>	 liw: excellent point
[10:53:30] <kormat>	 _joe_: people keep telling me to not worry here. it's getting worrying. ;)
[10:54:11] <marostegui>	 kormat: don't worry, it won't get worryinh
[10:54:18] <kormat>	 hahah
[10:57:50] <jynus>	 don't worry, it is worrying all the time!
[10:58:28] <jynus>	 :-D
[11:00:23] <volans>	 if it wasn't worrying I'd be worried
[11:06:23] <_joe_>	 kormat: the silver lining is: it's got better over the years
[13:47:52] <cdanis>	 kormat: if you haven't already had the python discussion, please attempt to bait volans into discussing 'wmfpylib'
[14:52:00] <chaomodus>	 o/
[14:54:00] <cdanis>	 chaomodus: o/ hope you enjoyed your time off :) not too much from clinic duty, some open access requests that require help from others (e.g. T250189)
[14:54:01] <stashbot>	 T250189: LDAP access to the wmf group for Sam Walton - https://phabricator.wikimedia.org/T250189
[14:54:28] <chaomodus>	 cool i shall begin hearding
[15:34:03] <_joe_>	 cdanis: weren't we gonna create it without volans knowing?
[15:34:10] <cdanis>	 that's a good idea
[15:34:17] <_joe_>	 else there will be 1 year of analysis paralisis
[15:35:02] <_joe_>	 then 6 months of nitpicks, add 3 months for fixing all the lint issues that all the linters he added are now reporting as incorrect...
[17:31:07] <cdanis>	 XioNoX: I don't know where else to look, and I suspect provider/IX problems or something, but RIPE Atlas connectivity to esams has been quite bad today
[17:31:10] <cdanis>	 https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-target_site=eqsin&var-target_site=esams&var-ip_version=All&var-country_code=All&var-asn=All
[18:41:38] <XioNoX>	 My internet is down and 4G decided to downgrade to HSPDA
[18:41:54] <XioNoX>	 cdanis: I'm hearing things about LibertyGlobal having issues
[18:42:11] <XioNoX>	 someone shared http://bix.hu/statisztika/liberty_global/d5f78a7e4ebbab8166f23d8a15e42263
[18:42:21] <cdanis>	 whoa
[18:42:53] <cdanis>	 in other news, I think we're in the middle of a medium mediawiki outage
[18:43:08] <cdanis>	 certainly a latency spike
[18:43:11] <chaomodus>	 we should encapsulate your brain into an icinga alert
[18:44:05] <cdanis>	 XioNoX: lol, those graphs are in local time
[18:44:26] <cdanis>	 and Budapest is +2 so yeah, that lines up with the RIPE Atlas stats
[18:45:02] <XioNoX>	 cdanis: people say it's global too, multiple countries
[18:46:31] <cdanis>	 I'm still kind of alarmed by the bigger IPv4 dropout from 05:20-06:50
[18:47:26] <cdanis>	 quite long, and quite large (under 96% of RIPE Atlas could reach esams over IPv4, usual values are north of 2.5 nines)
[18:49:25] <XioNoX>	 doesn't show up on netflow
[18:49:51] <cdanis>	 yeah, and I couldn't find it on traffic graphs either
[18:50:01] <cdanis>	 makes me think that it's from origins that aren't geodns'd to esams
[18:51:00] <XioNoX>	 ah yeah possible
[18:51:19] <Krenair>	 latency noticed in #wikimedia-tech
[18:51:29] <XioNoX>	 feature request for the rope atlas probe exporter :)
[18:51:34] <XioNoX>	 ripe*
[18:51:38] <cdanis>	 eheh
[18:51:46] <cdanis>	 we do have per-probe data in prometheus, XioNoX
[18:53:02] <XioNoX>	 cdanis: could be useful to have a visualisation of probes/as/countries that are under a certain threshold
[18:53:18] <XioNoX>	 not sure how and if it's doable though
[18:54:20] <cdanis>	 i'd like to add a 'continent' annotation to each country
[18:54:23] <cdanis>	 so you can filter on that too
[18:54:36] <cdanis>	 or an 'expected site' for each country, based on our geodns mapping
[18:55:27] <XioNoX>	 yeah that was the feature request I mentioned above ^
[18:55:32] <XioNoX>	 :)
[18:56:08] <chaomodus>	 i think there's a map kind of panel for grafana
[18:58:04] <cdanis>	 ah, wasn't sure what you meant
[18:59:24] <XioNoX>	 chaomodus: https://phabricator.wikimedia.org/T167689#5780784
[18:59:43] <chaomodus>	 XioNoX: *spidermans meme*
[19:09:52] <XioNoX>	 chaomodus: https://phabricator.wikimedia.org/T251184
[19:14:00] <XioNoX>	 cdanis: now that I'm in this rabbit hole, this visualisation might be useful too: https://grafana.com/api/plugins/raintank-worldping-app/versions/1.2.7/images/img/wP-Screenshot-dash-summary.png same, not sure if possible out of the box
[19:15:41] <cdanis>	 XioNoX: https://w.wiki/PFE
[19:16:10] <cdanis>	 AT,CZ,DE
[19:16:20] <XioNoX>	 nice!
[19:17:59] <cdanis>	 very possibly one ISP with that geography
[19:18:03] <cdanis>	 or something
[19:19:00] <XioNoX>	 I think it's possible to do horrible things with Grafana https://grafana.com/grafana/plugins/jdbranham-diagram-panel
[19:21:45] <cdanis>	 it certainly is
[19:21:49] <cdanis>	 you're just writing Node code, I think
[19:22:12] <cdanis>	 sorry, not Node; Angular and/or React
[19:22:18] <cdanis>	 got my JS horrors confused
[19:23:14] <_joe_>	 cdanis: did you make anything out of the mw slowness?
[19:23:38] <cdanis>	 _joe_: no, aside from "not databases or obviously memcached related, and also seemed unlikely to be the swat that was done"
[19:23:54] <cdanis>	 I didn't dig into investigating if it was a new scraper or bot, but that would be my next guess
[19:23:54] <rzl>	 cdanis: just got back from lunch, sorry to miss that :(
[19:24:08] <_joe_>	 yeah was my best guess as well
[19:24:11] <cdanis>	 https://grafana.wikimedia.org/d/ifM0GzjWk/cdanis-xxx-php-worker-threads?orgId=1&from=now-12h&to=now
[19:24:16] <cdanis>	 the worker thread situation is worrying
[19:26:40] <_joe_>	 not really, we still have 1k to spare even in the worst moments
[19:26:54] <_joe_>	 but yeah I'd look at apache logs for slow requests
[19:27:10] <cdanis>	 we don't know that _joe_
[19:27:49] <cdanis>	 and the overall # free doesn't matter for tail latency; what matters for tail latency is that many appservers have some small reserve
[19:27:59] <cdanis>	 and per the last two graphs we're violating that pretty regularly
[19:28:51] <cdanis>	 also, the stats we have on worker threads are very limited, both in that you need a free worker thread to sample the statistic itself, and in that it's just whenever prometheus happens to scrape you, so there are saturation events you can miss
[19:28:51] <_joe_>	 yeah I'm having a hard time finding one appserver with zero available workers though, lemme look at the cluster metrics
[19:30:11] <cdanis>	 you sould be able to get a set of times that a given server had 0 workers available from the tooltips on the graph of https://w.wiki/PFK
[19:32:32] <_joe_>	 this makes no sense
[19:32:41] <_joe_>	 all servers have normal cpu usage
[19:35:20] <_joe_>	 https://grafana.wikimedia.org/d/000000550/mediawiki-application-servers?orgId=1&var-source=eqiad%20prometheus%2Fops&var-cluster=appserver&var-node=mw1273
[19:35:28] <_joe_>	 this makes absolutely no sense to me
[19:35:37] <_joe_>	 let's go look at the slowlog on that server
[19:36:10] <_joe_>	 uhm, can someone check which servers are in the same rack as mw1273?
[19:36:21] <rzl>	 looking
[19:36:49] <rzl>	 mw1267 through mw1283
[19:36:52] <rzl>	 _joe_: ^
[19:37:04] <rzl>	 https://netbox.wikimedia.org/dcim/racks/7/
[19:37:38] <_joe_>	 so the slowlog is all queries
[19:38:02] <_joe_>	 this means, most requests taking more than 15 seconds were in the middle of a sql query
[19:39:36] <cdanis>	 are you suspecting access switch saturation?
[19:39:55] <_joe_>	 It's one possibility, yes
[19:40:08] <_joe_>	 so looking at https://config-master.wikimedia.org/pybal/eqiad/appservers-https
[19:40:24] <_joe_>	 two servers out of that group are at reduced weight
[19:41:37] <_joe_>	 now look at mw1263, another rack, weight 30
[19:41:38] <_joe_>	 https://grafana.wikimedia.org/d/000000550/mediawiki-application-servers?orgId=1&var-source=eqiad%20prometheus%2Fops&var-cluster=appserver&var-node=mw1263
[19:42:12] <_joe_>	 look at the first graph, workers saturation
[19:42:20] <_joe_>	 anything above 100 is problematic :)
[19:42:24] <_joe_>	 https://grafana.wikimedia.org/d/000000550/mediawiki-application-servers?orgId=1&var-source=eqiad%20prometheus%2Fops&var-cluster=appserver&var-node=mw1268
[19:42:36] <_joe_>	 this is mw1268, same hardware as mw1263
[19:42:41] <_joe_>	 same weight in the LB
[19:44:17] <_joe_>	 let's hope that's not it though,
[19:49:44] <cdanis>	 there's no drops or errors shown on mw1263's network stats
[19:50:15] <cdanis>	 I can't make too much of the discard stats shown on https://librenms.wikimedia.org/device/device=160/tab=ports/view=minigraphs/graph=errors/ except that there are more discards torwards the crs than I expected
[19:53:08] <cdanis>	 there are librenms alerts for the switches having too high a utilization on an individual port
[19:53:32] <cdanis>	 which i think should also cover the 'vcp' ports that make up the virtual chassis
[19:54:08] <cdanis>	 speaking of, there's been a handful of errors on this port since two weeks ago, but it doesn't look like much https://librenms.wikimedia.org/device/device=160/tab=port/port=18711/
[20:03:35] <cdanis>	 okay, I've failed to find evidence of access switch saturation
[20:03:45] <cdanis>	 the latency spikes are not as bad as they were
[20:16:21] <cdanis>	 today is the worst sustained appserver latency in a month and I have no idea why, sigh
[20:21:47] <XioNoX>	 cdanis: I think those drops are because the uplinks are 4x10G and LACP balance them per flow, so some flows can briefly saturate the physical uplinks
[20:22:08] <cdanis>	 makes sense
[20:22:16] <XioNoX>	 the error link is no factor year, we have an alert if they trigger
[20:22:28] <XioNoX>	 s/year/yeah
[20:22:31] <cdanis>	 yeah didn't think so either, just happened to notice it
[20:22:42] <cdanis>	 the per-flow balancing is done to avoid needing to worry about out-of-order delivery, which TCP interprets as possible packet loss?
[20:22:58] <XioNoX>	 haha I was like "there are errors I'm not aware of?!"
[20:23:05] <cdanis>	 lol yeah like 5
[20:23:44] <XioNoX>	 cdanis: yeah exactly about the out of order
[20:24:46] <XioNoX>	 fix is to upgrade those to 1x40G links or more, which we will be able to with new linecards
[20:29:15] <XioNoX>	 cdanis: I didn't follow the whole story, but for example mw1263 has under 3 TCP retransmits/s out of 30k segments/s https://grafana.wikimedia.org/d/000000365/network-performances?orgId=1&var-server=mw1263&var-datasource=eqiad%20prometheus%2Fops
[20:29:23] <cdanis>	 ah! forgot about TCP retransmits
[20:29:38] <cdanis>	 I think we're thinking the loss would be on services talking towards the mws though
[20:31:38] <XioNoX>	 cdanis: no significant increase https://grafana.wikimedia.org/d/000000366/network-performances-global?orgId=1&from=now-12h&to=now
[20:31:42] <cdanis>	 yeah
[20:31:57] <cdanis>	 what's up with the esams cache-text out dest-unreach?
[20:34:06] <XioNoX>	 cdanis: good questions, I dived on them a while ago but forgot what they were from since
[20:34:18] <XioNoX>	 I should check that dashboard more regularly
[20:47:37] <XioNoX>	 cdanis: culprit is maybe 20:42:47.928961 IP cp3064.esams.wmnet > recdns.anycast.wmnet: ICMP cp3064.esams.wmnet udp port 33145 unreachable, length 132 so maybe one side closing the socket too soon?
[20:48:15] <XioNoX>	 anyway, I'll look more at it tomorrow
[20:49:18] <cdanis>	 whoa
[20:49:37] <cdanis>	 uh getting unreachables on recdns seems bad
[20:50:54] <XioNoX>	 they don't match the graph frequency though
[20:52:05] <cdanis>	 hm okay
[20:52:12] <cdanis>	 I'm quite tired, I'm going to stop looking for now
[20:53:07] <XioNoX>	 it's only cp3064 too
[20:55:52] <XioNoX>	 AH!
[20:55:57] <XioNoX>	 -i any
[20:56:14] <XioNoX>	 20:55:49.147010 IP text-lb.esams.wikimedia.org > text-lb.esams.wikimedia.org: ICMP XXXX unreachable - need to frag (mtu 576), length 556
[20:56:27] <XioNoX>	 XXXX is some client IP
[21:07:04] <XioNoX>	 so PMTU to that specific client IP is 576, `cp3064:~$ ip route get to 194.44.x.x`
[21:10:19] <XioNoX>	 but for some reason cp3064 tries to send over and over a 578 byte IP packet (with a HTTPS payload, I think the handshake's ACK)
[21:10:43] <XioNoX>	 probably can't go under because of all the cipher lists?
[21:11:46] <XioNoX>	 So as it's too big, cp3064 sends a ICMP packet too big (aka frag needed) to itself (cp3064)
[21:12:26] <XioNoX>	 of course the "don't fragment" bit is set on the initial packet
[21:21:27] <XioNoX>	 So it's a broken client, but I'm wondering if ATS could/should fragment it (or not set the don't fragment bit)