[07:01:07] <_joe_> cdanis: that would be having some meaningful check in pybal, yes [07:01:34] <_joe_> notwithstanding my repeated pleas to that end, people seem to always overlook that when setting up a new servce [07:01:58] <_joe_> good thing kubernetes will make people pay unless they think better about it :) [12:39:53] indeed [12:40:18] godog: something odd happened in swift @ codfw around 19:30 [12:40:46] there was a lot of uploads (seen in eqiad too) but in codfw object dispersion dropped slightly afterwards and hasn't recovered [12:40:57] i haven't touched the rings since last week [12:42:53] cdanis: odd indeed, I'm taking a look too [12:43:18] oh the uploads were around 13:00, much earlier [12:44:43] weirdly the general pattern seen around 19:30 -- dispersion drops, read latency increases, big increase in backend network traffic -- *looks* like a change to the rings or re-replication of data [12:44:54] but I don't think we had a machine or disk failure there? [12:46:04] afaik no we didn't [12:48:11] only problem i see in icinga is ms-be2014 and those disks have been unhappy since ... 2019-04-17 09:38:15 by my logs [12:48:13] so idk [12:56:01] ah mhh there was actually a disk failure on ms-be2043 at 18:07, not sure yet if it is that [12:56:19] what I did is run 'swift-dispersion-report' on ms-fe2005 [13:00:07] ahh [13:01:06] 18:07 is plausible [13:03:31] yeah the disk failure is for sure a problem, although the spike in network happened an hour later afaics [13:08:41] swift is not so smart as to automatically re-replicate data, right? it will read from other replicas of that data, it will write new data to a 'handoff' location, but it won't proactively re-replicate existing data to a handoff location? [13:12:06] that's my understanding yes, "hinted handoff" in the sense that handoff locations are used only to buffer data while the expected location is unavailable, but don't get data actively replicated to them [13:17:12] not sure why nothing caught the drive failure [13:17:19] ok I'm spot checking the partitions with reported missing data, so far ms-be2043/sdd appears in all of them so that might be it after all [13:17:28] makes sense [13:17:59] heh, according to megacli the disk is fine -.- [13:18:43] however swift looks at kernel logs and umounts + comment fstab problematic disks/filesystems [13:19:04] it looks like the disk threw a read error and --- yeah [13:19:37] May 1 18:36:12 ms-be2043 smartd[1866]: Device: /dev/bus/0 [megaraid_disk_01] [SAT], ATA error count increased from 0 to 2 [13:19:52] smartd might have sent mail as well [13:20:11] (although i have no hits in my mail for ms-be2043) [13:20:22] I've added my 2 cents to the task [13:28:36] thanks ! [13:29:59] yeah I don't know if "media error count" should mean a task to investigate/replace the disk [13:30:57] that is fair [13:31:08] should we just remount it and see if it happens again? [13:31:57] +1, cdanis are you doing the honors? [13:32:11] will do [13:35:21] sweet! yeah dispersion will recover in 15m [13:35:36] the graphite metric that is, I ran dispersion-report manually at it is at 100% [13:36:36] {{done}} [13:39:42] thank you, I'm looking through the logs for the network spike because afaik that's unexplained by the disk failure [13:43:46] looking at grafana, the frontends did not have an increase in traffic, just most of the backend machines [13:44:30] just to make sure we all are in the same page, FYI there was a rolling restart of swift-proxy this morning because of issues at the CDN layer [13:52:19] indeed, that fix resulted in drastic 503s reduction \o/ [13:52:30] also found https://phabricator.wikimedia.org/T222365 while looking at the logs [13:56:02] cdanis: a tangential thought I had the other day, with the new swift hosts coming in it'd be interesting to try out "server_per_port" object-server config, IOW 1+ individual object servers per disk [13:56:24] oh interesting [13:56:46] what is object-server's execution model anyway? thread per request? something else? [14:00:54] multiprocess all listening on the same port iirc [14:05:09] interesting, those benchmarks look nice [14:17:03] indeed, I'm opening a placeholder task [14:40:01] sounds good! and looks pretty easy -- they made it pretty equivalent from an operational perspective [15:48:41] real 3275m36.699s -- the time it took for bast3002's first prometheus rsync to complete )o) [15:50:49] a whopping 200kb/s [16:54:31] haha wooow