[08:02:13] <elukey>	 hello people!
[08:02:37] <elukey>	 cp3010 was frozen for some reason, so I depooled it and powercycled
[08:02:50] <elukey>	 will wait for you guys before taking any more action
[08:37:28] <vgutierrez>	 elukey: frozen?
[08:37:33] <vgutierrez>	 elukey: you were unable to login?
[08:37:57] <elukey>	 vgutierrez: unreachable from ssh + mgmt console
[08:38:31] <elukey>	 didn't check exactly what happened in the lost, just depooled + rebooted basically
[08:38:42] <elukey>	 *logs
[08:39:18] <vgutierrez>	 ack, thx
[09:06:17] <vgutierrez>	 I don't see nothing weird on the machine logs, last event appears to be at 06:01:01, Mar 22 06:01:01 cp3010 CRON[20696]: (root) CMD (/usr/bin/logster -o statsd --statsd-host=statsd.eqiad.wmnet:8125 --metric-prefix=varnishkafka.cp3010.webrequest.misc JsonLogster /var/cache/varnishkafka/w
[09:06:22] <vgutierrez>	 ebrequest.stats.json > /dev/null 2>&1)
[09:07:44] <vgutierrez>	 brb
[09:29:26] <ema>	 good morning
[09:30:05] <ema>	 hey nothing on fire
[09:30:42] <ema>	 that means I can have breakfast today \o/
[09:32:30] <vgutierrez>	 ema: hahaha right
[09:32:55] <vgutierrez>	 cp3010 looks weird though, nothing on ipmi-sel nor the logs
[09:33:00] <vgutierrez>	 what am I missing?
[09:51:00] <ema>	 let's see
[09:53:10] <ema>	 so the server was rebooted ~ 2 hours ago
[09:53:27] <vgutierrez>	 yup
[09:53:31] <moritzm>	 Luca restarted it
[09:53:33] <ema>	 elukey: anything interesting in mgmt console?
[09:53:38] <vgutierrez>	 powercycled by elukey 
[09:54:01] <vgutierrez>	 apparently he wasn't able to reach the instance over ssh or mgmt console
[09:54:07] <ema>	 together with 3007 and 3008, 3010 is running 5.1.3-1wm7 since yesterday
[09:54:37] <elukey>	 ema: empty blank screen, unresponsive to any command
[09:54:41] <moritzm>	 I checked SEL, there were no failures events logged, just some stuff from 2012/2013
[09:54:53] <elukey>	 yep that too --^
[09:55:01] <elukey>	 (I tried racadm getsel)
[09:55:03] <ema>	 which includes the fix for https://github.com/varnishcache/varnish-cache/issues/1799 and the OH ref leak fix
[09:55:05] <vgutierrez>	 yup, same I've seen
[09:55:31] <vgutierrez>	 I don't think that a wild varnish is able to crash the mgmt interface though
[09:55:38] <ema>	 nothing interesting in kern.log
[09:55:46] <ema>	 Mar 21 06:25:06 cp3010 kernel: [1024227.296698] Process accounting resumed
[09:55:48] <moritzm>	 the server is six years old
[09:55:51] <ema>	 Mar 22 08:00:56 cp3010 kernel: [    0.000000] Linux version 4.9.0-0.bpo.6-amd64 (debian-kernel@lists.debian.org) (gcc version 4.9.2 (Debian 4.9.2-10+deb8u1) ) #1 SMP Debian 4.9.82-1~wmf1 (2018-02-19)
[09:58:07] <ema>	 so was the mgmt interface also down? It is up now
[09:59:38] <vgutierrez>	 <elukey>    vgutierrez: unreachable from ssh + mgmt console
[09:59:48] <vgutierrez>	 apparently
[10:00:22] <elukey>	 no sorry that was me not explaining it well
[10:00:46] <elukey>	 ssh was unreachable, mgmt was but no sign of a working tty
[10:01:00] <ema>	 ok
[10:09:42] <ema>	 I'm gonna repool the host, it looks all good to me now
[10:29:58] <vgutierrez>	 ema: btw, if you've a moment I need your wisdom O:)
[10:31:15] <vgutierrez>	 comparing varnishreqstats VS varnishreqstats.mtail, no rush though
[10:31:55] <vgutierrez>	 also if you're ok with it, I'm merging https://gerrit.wikimedia.org/r/#/c/415260/ (PyBal BGP icinga check) cause it's been hanging there a couple of weeks and there is no reason to delay it more
[10:35:35] <ema>	 ship it!
[10:35:40] <vgutierrez>	 <3
[10:46:29] <bblack>	 we should keep a running log/ticket somewhere of traffic we see that looks "questionable" for some reason or other (as in, questions for application developers about why it looks so strange)
[10:47:30] <bblack>	 my new addition for this morning from trawling problematic times in slowlog:
[10:47:35] <bblack>	 https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/user/Gray's_Inn_Lane_Hand_Axe/daily/20151001/20151031
[10:47:59] <bblack>	 it's fast now and generally non-problematic I think, but what's interesting is the response includes this header:
[10:48:11] <bblack>	 < content-location: http://aqs.svc.eqiad.wmnet:7232/analytics.wikimedia.org/v1/pageviews/per-article/en.wikipedia/all-access/user/Gray%27s_Inn_Lane_Hand_Axe/daily/20151001/20151031
[10:48:24] <bblack>	 why are we sending internal-only URLs in header responses for public APIs?
[10:49:43] <bblack>	 also, I've seen more small-gzip-response hints today.  I may look more at disabling it from the varnish side (i.e. blocking AE:gzip from varnish->app)
[10:50:37] <bblack>	 another interesting hint that I've gotten nowhere with yet:
[10:51:06] <bblack>	 some slow responses to regular wiki page accesses, have the following sort of pattern in their timings+headers:
[10:51:34] <bblack>	 time-apache-delay 0.101334
[10:51:51] <bblack>	 time-fetch 61.724494
[10:52:07] <bblack>	 response-Age 61
[10:52:38] <bblack>	 I've seen this more than a few times, where the inexplicable gap from apache-delay->fetch timing is (roughly to the second) the same as the Age value in the response...
[10:53:41] <bblack>	 I'm not sure what to make of that.  it may just be "normal" if the headers at least arrived quickly (varnish records age as zero on quick header reception, spends 61s fetching, then serves the object with Age=61?)
[10:54:58] <ema>	 interesting
[10:55:33] <ema>	 we can double-check if that is indeed varnish's behavior with vtc
[10:57:15] <ema>	 although if that's the case it sounds like a bug (age being the time an object has been in cache, if it hasn't been fetched yet how can it be in cache?)
[10:57:23] <bblack>	 well
[10:58:02] <bblack>	 I wouldn't say it's a bug, more a question of interpretation, and I think varnish's interpretation there is reasonable.
[10:59:21] <bblack>	 it's looking at it as: "The applayer sent no Age header or Age:0, we record Age:0 immediately and assume a fresh object has been created.  That it took 61s to finish moving bytes across the wire afterwards before we could then respond to the client is irrelevant.  61s has now passed since the object's birth and thus the age is now 61"
[11:00:10] <bblack>	 of course all of the above only makes sense if we're store-and-forward, but we're defaulting to streaming...
[11:00:32] <bblack>	 if we're streaming, it would stream through the Age:0, and spend 61s slowly streaming things through, right?
[11:00:58] <bblack>	 so it seems like this can only happen when it's not a wire-speed-limit on slowly transferring content.
[11:02:09] <bblack>	 ema: a question I don't have the answer to in my head: in normal default streaming mode, does varnish forward response headers to the client immediately if the body stalls?
[11:02:47] <bblack>	 I know we get separate vcl callbacks that give us the chance to make decisions after response header reception and before fetching the body...
[11:03:39] <bblack>	 so surely, it doesn't forward the headers until after the callback where we get to edit them and make decisions.  I wonder if it immediately sends output headers afterwards though, or waits until at least some body bytes are ready to stream through...
[11:08:17] <vgutierrez>	 sigh.. I assumed the behaviour of check_prometheus (bash) and check_prometheus.py was the same.. and of course I was wrong
[11:15:04] <vgutierrez>	 *check_prometheus_metric to be accurate
[11:17:54] <ema>	 bblack: age is 0 regardless of the time spent to fetch the body https://phabricator.wikimedia.org/P6881
[11:24:56] <bblack>	 ema: so then the reasonable interpretation of these slowlogs is that the applayer was setting that Age value?
[11:25:21] <bblack>	 if so it's still very odd....
[11:25:53] <bblack>	 I'm assuming (perhaps incorrectly) that the applayer does headers then content separately
[11:26:07] <bblack>	 it would seem impossible that it would predict its own body-send timing
[11:26:36] <ema>	 however!
[11:26:40] <bblack>	 but it also seems unlikely that if the applayer spent 61s spinning internally before it ever generated the header output, that it wouldn't be recording a low apache D= timing
[11:26:47] <ema>	 with beresp.do_stream=false in vcl_backend_response, Age=2
[11:26:55] <bblack>	 hmmm
[11:29:16] <elukey>	 bblack: re: content-location: http://aqs.svc.eqiad.wmnet:7232 - will ask to my team, could be some setting that needs to be tuned on the AQS side
[11:29:53] <bblack>	 elukey: well, I'd suspect this to be more of an RB-level thing and AQS just being an example
[11:30:14] <bblack>	 elukey: and they may well say "we need to return this as some unique key and that's the right value and clients are never expected to deref it", I donno
[11:30:38] <elukey>	 bblack: yep yep, but I'll open a task to investigate, it is weird 
[11:32:10] <ema>	 full vtc here for future reference https://phabricator.wikimedia.org/P6882
[11:37:20] <bblack>	 ema: https://gerrit.wikimedia.org/r/#/c/421267/ ?
[11:37:45] <bblack>	 we've lived with the junky bad-gzip -> 503 for 500s thing for a long time anyways
[11:38:25] <bblack>	 but I've seen a fair number of these CL:0 (or CL<100) responses that are application-gzipped in the slowlog investigations targeting problem times, too, so it's kind of a shot-in-the-dark at that.
[11:42:07] <ema>	 that would mean storing objects uncompressed on varnish-be at the backend-most layer, right?
[11:44:57] <ema>	 > If the server responds with gzip'ed content it will be stored in memory in its compressed form
[11:45:11] <bblack>	 it shouldn't mean that, no
[11:45:33] <bblack>	 because we also turn on do_gzip elsewhere for a wide range of compressible mime types
[11:45:43] <bblack>	 (so varnish will compress it on the way in)
[11:45:50] <ema>	 ah, right
[11:46:04] <bblack>	 the biggest risk is that there's a large jump in CPU% or something, from offloading applayer gzips to varnish
[11:46:31] <bblack>	 but I don't think gzip costs enough to matter, on the be-miss side of things
[11:48:38] <bblack>	 also https://gerrit.wikimedia.org/r/#/c/295372/
[11:48:51] <ema>	 ok yes, we do set do_gzip in wm_common_backend_response
[11:48:51] <bblack>	 for the cases where varnish is already the one gzipping, we tuned it there
[11:49:28] <bblack>	 if cpu% looks problematic, we could back down the settings or re-think (or finallly go package and deploy the cloudflare zlib improvements, if they haven't already trickled through upstream)
[11:51:58] <ema>	 ship it!
[11:53:41] <bblack>	 yeah, if you're not doing related things, I might puppet-disable the fleet and see how it looks on a single eqiad text node first (since it only really has cpu impact there)
[11:53:57] <ema>	 sure, go ahead
[11:55:09] <ema>	 I was on upload@esams reboots for kernel upgrades but don't mind a coffee break :)
[11:56:31] <bblack>	 btw, I haven't been looking hard enough but probably should more... any differential for the numa_networking seting in recent troubles?
[11:56:59] <bblack>	 in the esams text case it's 3030+3031 that have it, the other 6 don't
[11:57:12] <bblack>	 (there's 2/N from each site+cluster altogether)
[11:57:57] <ema>	 the non-numa networking text@esams hosts seemed to be equally messed up
[11:58:49] <ema>	 (at the general OMG it's broken level, I haven't looked closely)
[12:01:42] <bblack>	 right, ok
[12:03:14] <bblack>	 ema:
[12:03:19] <bblack>	 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?orgId=1&var-server=cp1065&var-datasource=eqiad%20prometheus%2Fops&from=now-1h&to=now
[12:03:30] <bblack>	 cp1065 only one, can see the +cpu% pattern there
[12:03:39] <bblack>	 also interesting is the matching +network recv
[12:03:47] <bblack>	 (since xfer is uncompressed from applayer)
[12:04:44] <bblack>	 anyways, seems sane enough, it's not going to suddenly melt all the things
[12:06:15] <vgutierrez>	 https://gerrit.wikimedia.org/r/#/c/421273/ merging fix for prometheus_query.. the old (bash) check requires the query to be wrapped into scalar()
[12:37:14] <bblack>	 ema: this too https://gerrit.wikimedia.org/r/#/c/421275/ , but waiting till after puppetmaster stuff settles down
[12:37:56] <bblack>	 (I think it will kill a lot of useless slowlog noise so we can see overall patterns of true misbehavior clearly)
[12:47:28] <bblack>	 also interesting is the +hitrate increase seems to be real, across both upload+text, from the elimination of the healthy/grace stuff in vcl_hit.  roughly from 95.3% global avg to 96.3% global avg on true-hitrate over a 24h period comparing same day-of-week a week before.
[12:48:27] <bblack>	 it sort of makes a rough sense, as we've elimiante a source of "return (miss)", but I wouldn't have expected an extra ~15m of grace-able hits to have such an impact (of course it's probably also the elimination of the related bugs, too)
[12:49:03] <bblack>	 https://grafana.wikimedia.org/dashboard/db/varnish-caching?orgId=1&from=now-7d&to=now&refresh=15m
[12:49:19] <bblack>	 ^ shows the overall pattern bump circa 2018-03-21 04:00
[12:50:21] <bblack>	 (it's not as visible in the "disposition" graph, but I've confirmed it's real there too.  actual hit/all goes up, miss/all goes down matchingly, and pass only drops slightly)
[12:52:05] <bblack>	 that's something like a net 20% drop in applayer requests between all clusters
[12:53:33] <elukey>	 wow
[12:55:44] <vgutierrez>	 interesting
[12:59:15] <bblack>	 err, the 20% number was true-hitrate-centric and thus inaccurate
[12:59:28] <bblack>	 when including all the pass-traffic and looking at the true overall rates
[12:59:37] <bblack>	 it's more like a 15% drop in applayer-facing requests out of varnish
[13:01:28] <bblack>	 ah not even that, if I really get apples:apples same day-of-week like before
[13:01:50] <bblack>	 week-earlier: https://grafana.wikimedia.org/dashboard/db/varnish-caching?orgId=1&from=1521605089730&to=1521691507140&var-cluster=All&var-site=All
[13:02:11] <bblack>	 hmmm nope, copypasta
[13:02:24] <bblack>	 recent day: https://grafana.wikimedia.org/dashboard/db/varnish-caching?orgId=1&from=1521605089730&to=1521691507140&var-cluster=All&var-site=All
[13:02:33] <bblack>	 week earlier: https://grafana.wikimedia.org/dashboard/db/varnish-caching?orgId=1&from=1521000289000&to=1521086707000&var-cluster=All&var-site=All
[13:02:55] <bblack>	 so terminal-layer average to applayer is 9.7% -> 9.0%
[13:03:10] <bblack>	 more like a 7% drop for that comparison
[13:04:22] <bblack>	 (avg across all clusters' requests)
[13:04:29] <bblack>	 and it's very upload-biased
[13:05:07] <bblack>	 upload lost ~30% there, and text only ~3% heh
[13:11:04] <vgutierrez>	 sigh.. we need to come with a way to benchmark graphite VS prometheus as metric backends
[13:11:28] <vgutierrez>	 we rely on dashboards a lot
[13:12:05] <bblack>	 benchmark the query performance behind grafana you mean?
[13:12:15] <bblack>	 in general prometheus is going to be slower, it's a whole different data model
[13:12:19] <vgutierrez>	 not the performance
[13:12:28] <vgutierrez>	 the quality / accuracy
[13:14:42] <bblack>	 heh
[13:14:46] <vgutierrez>	 so we need a way to validate the metrics sources, aka varnishstats VS .mtail programs
[13:15:05] <bblack>	 well I suspect a lot of it's going to be like the case I was blabbering about with ciphersuites
[13:15:09] <vgutierrez>	 yup
[13:15:27] <vgutierrez>	 but for varnish-caching, a pretty simple dashboard in terms of queries
[13:15:34] <vgutierrez>	 prometheus-varnish-caching looks weird
[13:15:42] <vgutierrez>	 and the data doesn't match
[13:15:43] <bblack>	 hammering on graphite randomly to get the graph shapes you want leads to statistically-invalid pretty things, and prometheus is a little more natural to get more accurate results
[13:17:09] <vgutierrez>	 yeah... but varnish-caching is not hammered at all
[13:17:16] <vgutierrez>	 it's just some rates and some counters over time
[13:18:21] <bblack>	 well
[13:18:55] <bblack>	 I have in the past even in simple cases tried to ask questions and/or look myself, and gotten confusing answers as to what .rate really means in various contexts around data sources from statsd or others, etc
[13:19:28] <bblack>	 I suspect there's some fundmanetal problems in our old pipelines for getting meaningful numbers without committing statistical sins
[13:19:54] <vgutierrez>	 so we should embrace the new dashboards and not look back?
[13:20:04] <bblack>	 I have never been able to convince myself that for these existing graphite-based ones, I can actually numerically trace how it works all the way back to what true-live-samples should look like at any given second
[13:20:36] <bblack>	 there's various kinds of sampling/smearing/averaging going on at multiple layers on different time divisions, etc
[13:25:54] <vgutierrez>	 so the best option should be comparing the output of the old (graphite) metric source VS the prometheus one
[13:26:15] <vgutierrez>	 to ignore the statitics crimes commited across our monitoring infra
[13:26:38] <bblack>	 yeah maybe
[13:30:25] <vgutierrez>	 https://phabricator.wikimedia.org/T184942 looked way easier than this *sigh* :)
[13:33:23] <vgutierrez>	 lunch time, brb
[13:41:30] <ema>	 yeah I've successfully managed to hide the open can of worms behind my back
[13:44:34] <bblack>	 and really, even if you think you've established a clear and statistically-valid pathway from "live OS stat" -> [...] -> "grafana graph"
[13:45:08] <bblack>	 then you have to consider what the live OS stat you're measuring really meant statistically, which is sometimes complicated and will thus muck up the rest of the pipline's validity as well
[13:46:01] <bblack>	 e.g. http://www.brendangregg.com/blog/2017-08-08/linux-load-averages.html
[13:48:34] <bblack>	 key quote:
[13:48:38] <bblack>	 "These three numbers are the 1, 5, and 15 minute load averages. Except they aren't really averages, and they aren't 1, 5, and 15 minutes. As can be seen in the source above, 1, 5, and 15 minutes are constants used in an equation, which calculate exponentially-damped moving sums of a five second average. The resulting 1, 5, and 15 minute load averages reflect load well beyond 1, 5, and 15 minutes.
[13:48:44] <bblack>	 "
[13:48:59] <bblack>	 #define EXP_1           1884            /* 1/exp(5sec/1min) as fixed-point */
[13:49:02] <bblack>	 #define EXP_5           2014            /* 1/exp(5sec/5min) */
[13:49:05] <bblack>	 #define EXP_15          2037            /* 1/exp(5sec/15min) */
[13:49:43] <bblack>	 so, good luck stuffing that into some averaging/smoothing stuff in grafana and building something that's accurately statistically meaningful in a way that relates to other system graphs shown beside it
[13:49:44] <ema>	 bblack: any ongoing experiments or can I carry on with reboots?
[13:50:01] <bblack>	 I don't know if the puppet agents are all re-enabled yet
[13:50:14] <bblack>	 (not from me, from the puppet ca work)
[13:50:47] <ema>	 right, I see that cp3044 has puppet enabled (though it ran last time 99 minutes ago)
[13:50:58] <vgutierrez>	 bblack: sigh^2 :)
[13:53:32] <bblack>	 my purpose in life is to crush other people's souls by shining light on the inscrutable madness that they thought was a well-ordered world they could understand :)
[13:54:12] <vgutierrez>	 too bad that doesn't fit into a tshirt
[13:55:40] <godog>	 puppet reenabled everywhere btw
[13:55:47] <bblack>	 yay!
[13:55:48] <vgutierrez>	 <3
[13:59:30] <vgutierrez>	 yey.. after running puppet-agent on icinga the bgp session check is working as expected
[13:59:33] <vgutierrez>	 :D
[14:01:36] <godog>	 \o/
[14:05:21] <bblack>	 I should at least make a task about the 200-OK-failed thing before I forget that one
[14:18:11] <wikibugs>	 10Traffic, 10MediaWiki-API, 10Operations: Query API for rev props times out with an error message, but status is 200 OK - https://phabricator.wikimedia.org/T190410#4072792 (10BBlack)
[14:28:34] <wikibugs>	 10Traffic, 10MediaWiki-API, 10Operations: Query API for rev props times out with an error message, but status is 200 OK - https://phabricator.wikimedia.org/T190410#4072782 (10Anomie) >>! In T190410, @BBlack wrote: > 1. Relatively-minor issue: It times out internally: there's a ~60s pause before any output is...
[14:50:55] <wikibugs>	 10Traffic, 10MediaWiki-API, 10Operations: Query API for rev props times out with an error message, but status is 200 OK - https://phabricator.wikimedia.org/T190410#4072879 (10BBlack) >>! In T190410#4072815, @Anomie wrote: > Define "times out internally".   I mean this from the naive point of view of: when I...
[15:20:43] <wikibugs>	 10Traffic, 10MediaWiki-API, 10Operations: Query API for rev props times out with an error message, but status is 200 OK - https://phabricator.wikimedia.org/T190410#4072934 (10Anomie) >>! In T190410#4072879, @BBlack wrote: >>>! In T190410#4072815, @Anomie wrote: >> Define "times out internally".  >  > I mean...
[15:34:58] <wikibugs>	 10netops, 10Operations: eqiad 10G ports needs - https://phabricator.wikimedia.org/T190364#4073014 (10faidon) I have only hunches and no data to back any of this, but I think ElasticSearch, Hadoop, WMCS, Backups, plus probably Ganeti and Kafka would be good candidates to go 10G-only. Kubernetes I could see it g...
[15:37:58] <wikibugs>	 10netops, 10Operations: eqiad 10G ports needs - https://phabricator.wikimedia.org/T190364#4071230 (10Ottomata) I'd love to see stream processing from Kafka running in Kubernetes one day (pipe dream!), and that could be highish traffic.
[16:19:07] <ema>	 staring at graphs again
[16:19:40] <ema>	 this is the 20th, number of backend objects:
[16:19:41] <ema>	 https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?orgId=1&from=1521477686128&to=1521553407897&panelId=8&fullscreen
[16:20:48] <ema>	 and the 21st:
[16:20:50] <ema>	 https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?orgId=1&from=1521542663000&to=1521676729327&panelId=8&fullscreen
[16:21:51] <ema>	 now as bblack was saying earlier, we've kept on focusing on the varnish-be side because it's the one doing all the crazy things, but perhaps they are induced by some abnormal varnish-fe behavior
[16:24:05] <ema>	 s/panelId=8/panelId=3/ in those URLs above to get to failed fetches
[16:28:22] <wikibugs>	 10netops, 10Operations, 10cloud-services-team: modify labs-hosts1-vlans for http load of installer kernel - https://phabricator.wikimedia.org/T190424#4073206 (10RobH) p:05Triage>03Normal
[16:44:52] <wikibugs>	 10Traffic, 10MediaWiki-API, 10Operations: Query API for rev props times out with an error message, but status is 200 OK - https://phabricator.wikimedia.org/T190410#4073263 (10BBlack) >>! In T190410#4072934, @Anomie wrote: >>>! In T190410#4072879, @BBlack wrote: >> However, I find it a bit specious to use the...
[16:53:07] <_joe_>	 uhhh I must resist the urge to answer on that ticket?
[16:54:25] <wikibugs>	 10Traffic, 10Fundraising-Backlog, 10Operations, 10fundraising-tech-ops: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561#4073341 (10BBlack) Yeah I do have concerns here.  It's going to take some time before I can loop back and explain them, but I just wanted to put the not...
[17:17:24] <wikibugs>	 10netops, 10Operations, 10ops-eqiad: Rack/cable/configure asw2-a-eqiad switch stack - https://phabricator.wikimedia.org/T187960#4073423 (10Cmjohnson) Row A connections are complete  cr1 xe-3/0/0  -> xe-2/0/44 #4776 cr1 xe-3/1/0  -> xe-2/0/45  #3452 cr1 xe-4/0/0  -> xe-7/0/44 #1985 cr1 xe-4/1/0  -> xe-7/0/45...
[17:43:53] <wikibugs>	 10Traffic, 10Fundraising-Backlog, 10Operations, 10fundraising-tech-ops: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561#4073507 (10CCogdill_WMF)
[17:44:15] <wikibugs>	 10Traffic, 10Fundraising-Backlog, 10Operations, 10fundraising-tech-ops: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561#4012170 (10CCogdill_WMF) Updating task as I want to update the subdomain in the request.
[17:50:49] <XioNoX>	 bblack: long shot, but do we have data on latencies that are acceptable for users or not? or a median/average latency we should aim for in general?
[17:51:15] <XioNoX>	 I remember a paper or blogpost written by google or youtube on the topic, but can't find it anymore
[18:10:18] <wikibugs>	 10Traffic, 10MediaWiki-API, 10Operations: Query API for rev props times out with an error message, but status is 200 OK - https://phabricator.wikimedia.org/T190410#4073589 (10Anomie) On the other hand, with all those different proxies that makes it much more likely one of them is going to throw its own error...
[18:21:10] <elukey>	 bblack: thanks to the services team the content-location issue of this morning should be fixed now!
[18:27:42] <bblack>	 XioNoX: eh there's a lot of wishy-washy data on that, but it's highly context dependent and somewhat subjective
[18:28:07] <bblack>	 XioNoX: either way, we're aiming for the best we can reasonably get given constraints! :)
[18:28:35] <XioNoX>	 bblack: yeah, it's for generating a pretty map, what latency should be green, what should be red, etc. :)
[18:32:21] <wikibugs>	 10netops, 10Operations, 10ops-eqiad: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962#4073679 (10Cmjohnson) @ayounsi   xe-2/0/44 -> cr1-eqiad:xe-3/0/2 #1984 xe-2/0/45 -> cr1-eqiad:xe-3/1/2 #3452 xe-7/0/44 -> cr1-eqiad:xe-4/0/2 2627 xe-7/0/45 -> cr1-eqiad:xe-4/1/2 346...
[18:45:04] <wikibugs>	 10netops, 10Operations: Security audit for tftp on Carbon - https://phabricator.wikimedia.org/T122210#4073756 (10Dzahn)
[18:48:29] <wikibugs>	 10netops, 10Operations: Security audit for tftp on install1001 - https://phabricator.wikimedia.org/T122210#4073763 (10ayounsi)
[18:50:21] <wikibugs>	 10Traffic, 10MediaWiki-API, 10Operations: Query API for rev props times out with an error message, but status is 200 OK - https://phabricator.wikimedia.org/T190410#4073785 (10BBlack) >>! In T190410#4073589, @Anomie wrote: > On the other hand, with all those different proxies that makes it much more likely on...
[18:51:36] <wikibugs>	 10netops, 10Operations, 10ops-eqiad: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962#4073787 (10Cmjohnson) change cable number for  cr1 xe- 4/0/1 to following  xe-7/0/44 -> cr1-eqiad:xe-4/0/2 #3509
[19:04:49] <wikibugs>	 10Traffic, 10Operations, 10Patch-For-Review, 10Performance-Team (Radar): Define turn-up process and scope for eqsin service to regional countries - https://phabricator.wikimedia.org/T189252#4073815 (10BBlack) We've talked about this a bit this week.  Basic initial steps of the plan at this point are:  1) T...
[19:07:51] <wikibugs>	 10Traffic, 10MediaWiki-API, 10Operations: Query API for rev props times out with an error message, but status is 200 OK - https://phabricator.wikimedia.org/T190410#4073831 (10Anomie) >>! In T190410#4073785, @BBlack wrote: > Yes, it's entirely possible any of the proxies might cause or record errors, and we'd...
[19:08:03] <wikibugs>	 10Traffic, 10MediaWiki-API, 10Operations: Query API for rev props times out with an error message, but status is 200 OK - https://phabricator.wikimedia.org/T190410#4073833 (10Anomie)
[19:08:26] <wikibugs>	 10Traffic, 10MediaWiki-API, 10Operations: Query API for rev props times out with an error message, but status is 200 OK - https://phabricator.wikimedia.org/T190410#4072782 (10Anomie) It's now clear to me that this task is a duplicate of T40716, so I'm going to close it as such.
[19:16:38] <wikibugs>	 10netops, 10Operations, 10ops-eqiad: Rack/cable/configure asw2-a-eqiad switch stack - https://phabricator.wikimedia.org/T187960#4073864 (10ayounsi)
[19:17:16] <wikibugs>	 10netops, 10Operations, 10ops-eqiad: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962#4073866 (10ayounsi)
[19:49:51] <wikibugs>	 10Traffic, 10MediaWiki-API, 10Operations: Query API for rev props times out with an error message, but status is 200 OK - https://phabricator.wikimedia.org/T190410#4073920 (10BBlack) >>! In T190410#4073831, @Anomie wrote: > Yes, there's a difference between getting "500 Something broke" and getting a data st...
[20:58:10] <wikibugs>	 10Traffic, 10Operations: How is Varnish errorpage enabled for empty 404 text/html from mw/index.php?action=raw - https://phabricator.wikimedia.org/T190450#4074178 (10Krinkle)
[21:01:42] <wikibugs>	 10Traffic, 10Operations: How is Varnish errorpage enabled for empty 404 text/html from mw/index.php?action=raw - https://phabricator.wikimedia.org/T190450#4074189 (10Ragesoss) @Krinkle thanks much! From that description, I'm guessing this won't affect many people. I was querying for arbitrary .json pages and n...
[21:07:49] <wikibugs>	 10Traffic, 10Operations: How is Varnish errorpage enabled for empty 404 text/html from mw/index.php?action=raw - https://phabricator.wikimedia.org/T190450#4074212 (10Krinkle) a:05Krinkle>03None
[21:51:17] <wikibugs>	 10Traffic, 10MediaWiki-API, 10Operations: Query API for rev props times out with an error message, but status is 200 OK - https://phabricator.wikimedia.org/T190410#4074369 (10Anomie) >>! In T190410#4073920, @BBlack wrote: > What exactly is the client going to do differently with `<error code="internal_api_er...
[23:45:49] <wikibugs>	 10Traffic, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 11 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#4074620 (10kchapman) TechCom discussed this at our last meeting. The problem statement is still valid, but that doesn't mean it needs to be kept open as th...