[11:01:42] <_joe_> so, regarding the page [11:02:08] <_joe_> interestingly I don't see any check in the apache logs after 10:34:10 [11:02:09] jynus beat me to it.. 10:36:11 :) [11:02:42] <_joe_> and not until 10:45 I see another one [11:02:55] <_joe_> sorry, 10:43 [11:03:42] <_joe_> vgutierrez: pybal logs say something about depools/repools happening around that time? [11:04:30] not for that service [11:04:30] there are three SOFT 1's before that one, in case that is helpful: 07:31:27 09:07:09 10:04:41 [11:05:22] <_joe_> apergos: so this is even worse [11:05:24] <_joe_> I see [11:05:30] <_joe_> 2020-07-22T10:04:05 550661 208.80.153.74 proxy:unix:/run/php/fpm-www.sock|fcgi://localhost/200 23909 GET http://en.wikipedia.org/w/api.php?action=query&meta=siteinfo - text/html - -check_http/v2.2 (monitoring-plugins 2.2) - - - - 208.80.153.74 - - [11:05:32] <_joe_> [11:06:12] <_joe_> this says that at 10:04:05 icinga2001 was able to query and got a response in 550 ms [11:06:41] I was looking at icinga1001 to be clear [11:06:46] <_joe_> yes [11:06:48] but that is the passive one [11:06:51] <_joe_> and I find nothing from icinga1001 [11:07:00] I'd say that's a different request [11:07:02] are you thinking cross-site latency issues? [11:07:23] <_joe_> I don't know [11:07:32] <_joe_> vgutierrez: what do you mean? [11:07:47] 10:04:05 VS 10:04:41? [11:08:02] or something is running awfully slow on the icinga side.. [11:08:11] <_joe_> vgutierrez: well it's more complicated than that, but sure, it was icinga2001, I said so [11:08:41] <_joe_> so interstingly if I search [11:08:55] <_joe_> "fgrep T10:04 /var/log/apache2/other_vhosts_access.log | fgrep 208.80.154.84 | fgrep '/w/api.php?action=query&meta=siteinfo' | grep -v lastindex || true" [11:08:56] T10: Where does this go - https://phabricator.wikimedia.org/T10 [11:09:07] <_joe_> there is *nothing* in the apache logs [11:09:21] why thanks stashbot :-/ [11:09:26] <_joe_> now lemme look at the tls terminator ones [11:09:58] I short circuited that and just looked directly at the syslog on the icinga host [11:10:33] hmmm [11:10:37] [2020-07-22 10:36:11] SERVICE ALERT: api.svc.codfw.wmnet;LVS api codfw port 80/tcp - MediaWiki API cluster- api.svc.eqiad.wmnet IPv4 #page;CRITICAL;SOFT;1;CRITICAL - Socket timeout after 10 seconds [11:10:56] see svc.codfw and svc.eqiad afterwards [11:11:18] could it be that icinga1001 is checking eqiad even for the codfw api svc? [11:11:22] and 2001 codfw? [11:11:34] some ::site evaluation [11:11:37] can we check icinga2001 to see if it got any even soft issue around the same time? [11:12:04] <_joe_> vgutierrez: that would be perplexing, given eqiad didn't alert [11:12:10] <_joe_> also no, I do see the queries [11:12:14] <_joe_> just not at that time [11:12:25] we have a nice discrepancy there though [11:12:28] <_joe_> and there is no trace in the envoy logs eitehr [11:12:35] <_joe_> *either [11:12:38] Jul 22 09:08:48 icinga2001 icinga: SERVICE ALERT: api.svc.codfw.wmnet;LVS api codfw port 80/tcp - MediaWiki API cluster- api.svc.codfw.wmnet IPv4 #page;CRITICAL;HARD;3;CRITICAL - Socket timeout after 10 seconds so it failed earlier [11:12:45] and there are no fails logged after that one [11:13:11] <_joe_> apergos: uh, 09:08??? [11:13:16] yep [11:13:17] <_joe_> I'm not sure I understand [11:13:39] answering jynus' question: that's the last entry of soft or hard failure, nothing else is logged after that [11:13:44] <_joe_> we got a page at 10:40 for something happened at 09:08 [11:13:49] <_joe_> ? [11:13:51] no no [11:13:58] that's icinga2001 [11:14:00] <_joe_> oh that's icinga2001 [11:14:00] we are comparing the other host [11:14:12] as a suggestion to discard monitoring issue, and to get more info [11:14:24] both icinga host do checks, only 1 alerts [11:14:35] at least for normal checks [11:15:03] <_joe_> ok, I have to go afk, we can continue later [11:15:11] <_joe_> but ofc the logs have no trace of that request [11:15:29] <_joe_> which could be explained with icinga closing the connection abruptly maybe? [11:15:43] <_joe_> anyways, more on this later, bbiab [11:16:46] my conclusions so far: the alert is flopping sometimes, but it depends on the request [11:16:48] so it looks like only the service description is wrong on icinga1001 [11:17:16] but the actual requests are being made against the proper DC [11:17:29] it got a HARD at ~9 for icinga2001, and around 10:36 for icinga1001 [11:26:46] hello! I have a docker image I'd like to create in production-images if anyone has a few minutes for review https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/615168 [11:26:58] nothing too fancy but a bit longer winded than most of the docker-pkg templates [11:33:18] hnowlan: I can take a look [11:37:14] jayme: thanks! [11:49:40] <_joe_> so, nothing in access logs at the time of the issues, nothing in the error log [11:49:47] <_joe_> I can't really understand what's going on [11:50:13] <_joe_> the only thing I know for sure is that the codfw cluster doesn't seem to actually act up [11:50:58] <_joe_> the alternative is that for some reason that request sometimes takes longer than 10 seconds and disconnecting the client before it finishes causes that not to be logged [11:51:14] <_joe_> but then I'd expect to see something in the envoy logs about that [11:51:55] <_joe_> except this was on port 80, not 443 [12:10:42] where is `passwords::misc::scripts` defined in puppet? *scratches head* [12:11:39] kormat: /srv/private/modules/passwords/manifests/init.pp on puppetmaster1001 [12:12:52] jbond42: huh. that path doesn't map to what i expected puppet to do [12:13:40] in my head that's basically puppet's tagline *g* [12:13:50] 👏 [12:15:08] kormat: /srv/private/modules/ is on puppets module path as such you can includ the password modules with include passwords::foo [12:15:22] all password modules are defined in the password init.pp file [12:16:34] i thought a tenant of puppet was that the filesystem layout corresponded to the puppet path [12:16:51] e.g. i would have expected it to be in `/srv/private/modules/passwords/manifests/misc/scripts.pp` [12:17:35] *tenet [12:17:50] kormat: that is a good assumption and is definetly the convention [12:18:21] however puppet (i belive) will slurp everything under modules/$mymodule/manifests [12:18:46] ohno [12:51:49] hiya moritzm yt? [12:52:14] debian pkg q for ya [13:03:45] or maybe for akosiaris ? [13:03:55] uh oh [13:04:07] ottomata: just ask, there's like half a dozen Debian people here :) [13:04:32] haha true! [13:04:33] well ok [13:04:34] ottomata: in the middle of a migration [13:04:38] ok ignore! [13:05:02] ok so it seems that some .so and .a files are being stripped from the final deb packagin [13:05:11] i'm trying to build anaconda .deb [13:05:37] not from source, i'm just repackaging bascially a tarball release, which includes a bunch of prebuilt binaries [13:06:03] and i'm not sure what is removing them [13:07:22] <_joe_> well .a files are probably statically linked in the final binary? [13:07:42] <_joe_> ottomata: if you have a tarball release, you can just build a binary debian package [13:07:59] <_joe_> that's both gross and makes sense in this case [13:09:17] https://gist.github.com/ottomata/2e9933dfac6c9f05ea314eb9eb86e454 [13:09:32] _joe_: do I have to do something special to do that? [13:09:46] googling,.,,. :) [14:16:32] dh invokes dh_strip by default, I think adding an override should fix this, add to debian/rules: [14:16:34] override_dh_foo: [14:16:37] override_dh_strip: [14:16:45] dh_strip -k [14:17:08] or simply # Skipping dh_strip, as we use prebuilt binaries we don't want to strip in addition [14:21:03] elukey: oops, sorry for not getting back yesterday -- that nvme link is fascinating [14:50:08] rzl: np! no urgency, I was just pinging you to know your opinion [14:50:47] it sounds like it'd be a Project to even experiment with, and I have no idea what the results would be like for our workloads, I'd be interested in what the perf folks think [14:51:22] but that writeup makes it sound pretty tantalizing [14:51:24] IIRC we used NVMe in some use cases (cp nodes maybe, but I am not sure) [14:53:30] I brought it up now since it would be great to either test a shard with NVMe before doing the refresh of mc*, or ask our vendors if NVMe can be added later on [15:24:33] would it need 10 Gb cards as well? [15:28:17] hard to know, I think -- NVMe wouldn't change anything about our bottlenecking and gutter-pool failover under heavy load, so in that sense it's an orthogonal question [15:28:40] but if NVMe works really, really, well, it's possible that our hit ratio would improve to where we needed more network throughput under normal conditions [15:28:45] so yeah, we've done nvme for our latest few cache node batches [15:28:58] (that's hard to predict but feels unlikely) [15:30:07] we used to use SATA SSDs on our older caches, then for one purchase we did nvme with U.2, then the last couple of purchases we've used the card version (nvme on a separate HHHL card) [15:30:28] how has it treated you? [15:30:39] https://wikitech.wikimedia.org/wiki/Traffic_cache_hardware has the hardware stuff [15:31:32] if random access latency of disk storage and/or raw xfer bandwidth is a bottleneck, I think it can be an improvement over plain SSD [15:32:11] they have some nice features - it gets rid of some legacy interface bottlenecks, and they do native 4K blocks (so storage block size == memory page size) [15:32:34] but we've never really tried to quantify what the performance improvements bought us vs cost diff [15:33:07] (at the time we started going that way, we were in an "anything to help storage help is probably a win" sort of mode, back when varnish-be was using the disks and flailing a lot) [15:33:27] that makes sense [15:33:49] in our case it isn't completing with disk obviously, but if the latency is better enough, we might expand the cache from RAM onto it [15:33:58] ah [15:34:01] even SSD is too high-latency for that to fly [15:34:23] the kind of mass nvme we're using on the caches, it's not so much better than SSD to start looking like a RAM substitute [15:34:25] but the idea is the keys stay in memory and the less-used larger values move to extstore via nvme [15:34:31] it's better-than-SSD, categorically, but not RAM-like [15:34:59] might be different with intel's 3dxpoint stuff [15:35:11] (which costs more and is a little more complicated to configure/use!) [15:35:58] https://wikitech.wikimedia.org/wiki/Traffic_cache_hardware [15:36:00] oops I meant: [15:36:05] https://www.intel.com/content/www/us/en/architecture-and-technology/optane-dc-persistent-memory.html [15:36:55] 10 microseconds for a read, apparently [15:38:39] yeah, they tested that in elukey's article (apparently in exchange for free hardware to test with) [15:38:47] https://memcached.org/blog/nvm-caching/ was the link [15:40:00] do we have an idea of a uh [15:40:02] okay I'm gonna say it [15:40:06] memcache performance SLO? [15:40:27] we don't yet, as far as I know [15:40:29] well, that's not strictly true [15:40:37] I have a yellow sticky on my desk that says "memcache SLO?" on it [15:40:50] so you could say we have the beginnings of one [15:41:36] (highlighting effie on this conversation too since she's been thinking about mc performance evaluations lately) [15:44:16] roughly, I think we've been maintaining a "status quo performance SLO" -- every architectural change we've talked about, the expectation is that it'll either make things better than they were before, or at least not worse, and if we couldn't meet that we'd probably revert [15:44:42] which isn't great for obvious reasons but it's where we're at [15:45:10] even evaluating the "better than before" is very difficult at the moment [15:45:17] at least for me [15:46:01] yeah -- and especially one of the things we've talked about in serviceops is we care about the performance in two distinct regimes, normal and hot-key [15:47:17] I know that I am repetitive but memcached 1.5.x will bring us a ton of perf benefits [15:47:36] it will take a bit to tune it, but at then end it will outperform 1.4 in my opinion [15:49:07] * elukey looks at https://phabricator.wikimedia.org/T252391 with hope [15:50:11] <_joe_> first question is: what metrics you want to use in the SLO? [15:50:24] <_joe_> the worst latency in fetching a key, as measured from mcrouter? [15:51:32] I think you need to break it down by key family, or pick a 'standard candle' read size [15:52:28] by size or by frequency? [15:53:42] <_joe_> cdanis: ok, I don't think we have data broke down by key size in mcrouter [15:53:46] by frequency you mean rps? [15:54:08] <_joe_> hence why we track the worst average performance over all the cluster [15:54:30] <_joe_> mcrouter only gives you an average latency, but it's highly correlated with memcached performance [15:55:03] volans: assuming you mean rps, well, they're all connected ofc [15:55:09] you get a sort of SLO flight envelope [15:57:24] cdanis: I meant that maybe we want to set that the XX% most requested keys have some latency threshold while the remaining have a more relaxed one [15:57:31] in particular if we use 2 different physical medium to store them [15:57:35] oh, sure [15:58:04] that's reasonable, and defining latency SLOs with a few percentiles is natural anyway [15:59:15] yeah, and to _joe_'s point, that would mean improving our instrumentation -- we don't actually have latency percentiles here afaik [15:59:39] but that would be a good improvement even if we weren't writing an SLO [15:59:39] <_joe_> yes, we don't [16:00:02] <_joe_> also I would not trust php with reporting sub-ms latencies appropriately [16:00:17] <_joe_> so we'd have to add the percentiles buckets to mcrouter [16:00:56] isn't that a good idea anyway? [16:01:07] <_joe_> cdanis: prioritization is key here [16:01:10] ok [16:01:11] <_joe_> sure it is [16:01:46] <_joe_> but the effort/benefit I'm not sure it's worth it [16:02:31] <_joe_> if someone wants to spend a couple days finding out how hard that would be, and if it's relatively easy I could think to try to submit such a patch upstream [16:02:45] <_joe_> maintaining a patch to mcrouter is something I would not do. [17:01:43] _joe_ why not? it would be a joyful experience after every release :D [17:09:36] jynus: there's some rumblings in the networking world that may result in downtime or migration for the transferpy-test VMs. Can you give me a rough idea of how disruptive that would be? [17:09:47] (I will open a ticket about this as soon as I actually know anything) [17:10:32] that's ok, they are being used, but they are for testing [17:10:45] good news! thanks [17:10:46] if they get an interruption for some hours, that is ok [17:11:04] i will definitely warn if that's going to happen [17:11:13] if we are one of the few affected people, could you send an email to all admins, if that is easy? [17:11:22] yep [17:11:27] as in, if it is practical to ask and it is not a million emails [17:12:29] also, as promised, we will close down the project in august (1 month life) [17:34:26] elukey: rzl there is a test that I have been wanting to do [17:34:52] which is to firewall all mc* hosts from 1 appserver and 1 api server [17:35:08] and let those two use soley the gutterpool [17:36:16] since the gutterpool is 1.5 [17:36:44] oh interesting [17:37:28] we mentioned in some chats we had with mar.k, it is a quick way to test memcached 1.5 with our workload [17:38:25] I have not fully thought this through, but I think it would help us do some config optimisations, if needed, before we rebuild the whole cluster [17:39:23] the only nit pick would be to remove the static TTL we add to objects that end up in the gutter pool [17:40:12] and since those servers have 10G nics, we could even try to add more app/api servers to the experiment [17:40:26] yeah, and that's configured in mcrouter, right? so not hard to do per-appserver [17:40:57] mmm, unless something else was using the gutter pool at the same time because there was an actual TKO [17:41:05] then they'd read each other's TTLs and that gets messier [17:41:54] well, we can consider ourselves *very* unlucky [17:42:10] and revert the experiment, which shouldnt be very hard anyway [18:01:49] https://phabricator.wikimedia.org/P12022 [18:01:51] :)