[06:47:06] good morning [06:47:17] around 4:30 UTC there was a memcached TKO event [06:47:27] (for appservers) https://grafana.wikimedia.org/d/000000549/mcrouter?orgId=1&var-source=eqiad%20prometheus%2Fops&var-cluster=appserver&var-instance=All&var-memcached_server=All [06:48:41] that was for multiple shards [06:49:26] mc1022, mc1036, mc1029 [06:50:26] the part that I don't like much is that in the past few days we had TX bw saturation events for multiple shards [06:50:29] https://grafana.wikimedia.org/d/000000316/memcache?panelId=56&fullscreen&orgId=1&from=now-2d&to=now [07:14:45] <_joe_> elukey: I think that that 4:30 event lines up with some other data [08:45:42] _joe_: RE: api.svc.codfw|eqiad confusion.. can we get rid of ::site on the description? https://github.com/wikimedia/puppet/blob/production/hieradata/common/service.yaml#L83 [08:46:06] cause that's not working as expected [08:46:32] <_joe_> vgutierrez: sure, but it's not working as expected when used in a place we didn't use it at first [08:46:44] <_joe_> let's fix all of those :) [08:48:52] just make sure we don't create duplicate descriptions in icinga ;) [08:49:50] doh, we have again some: 'ganeti-mond running' is duplicated multiple times (cc akosiaris ) [08:49:51] right now we have duplicated descriptions already [08:50:03] vgutierrez: all teh duplicates are not checked [08:50:08] only one [08:50:12] that's for the hosts [08:50:17] not for the descriptions, AFAIK [08:50:29] volans: interesting! [08:50:43] I haven't noticed [08:51:04] in any case.. ::site is being rendered in icinga1001 always as eqiad [08:51:10] vgutierrez: that's for 'service_description' in the icinga config [08:51:21] sure, and that's wrong [08:51:21] that's why the api.svc.codfw description says api.svc.eqiad [11:23:50] possibly dumb question - restbase-dev1004 is set up with software raid1 and a disk just got replaced. What's the correct way to proceed as far as re-adding the disk is concerned as regards our automation etc? Can I just partition it and add to the array by hand? [11:24:09] rather than re-build the host entirely :) [11:30:26] hnowlan: yes yes there is no problem with that, if the array rebuilds correctly you can manually fix it [11:30:57] cool, thanks! [11:55:55] btw _joe_ vgutierrez I filed T258697 [11:55:56] T258697: stop using $::site in description field of service.yaml - https://phabricator.wikimedia.org/T258697 [11:56:27] ohh thanks [11:56:51] I think we should just rip it out entirely, unless I'm missing something about how the description field is used [12:06:51] yeah, I agree with you [12:27:52] akosiaris: is your puppet merge still running after 5 minutes? [12:28:00] or you forgot to answer yes? :) [12:28:31] marostegui: guilty as changed [12:28:37] hahaha [13:09:52] _joe_: so, I want to somewhere add some consistency checking between conftool records and puppetized realserver LVS IPs [13:10:15] <_joe_> cdanis: yeah I was thinking about it last night [13:10:26] I've thought of a few possibilities but they're all annoying [13:11:57] <_joe_> cdanis: I reached the conclusion that having two sources of truth is stupid [13:12:01] it is [13:12:33] did you have any good ideas on fixing the larger problem? [13:12:36] _joe_: ITYM 2 sources of lies [13:15:53] kormat: oh, come now, what pieces of software could be more trustworthy than Puppet, or some Python joe wrote that talks to etcd? [13:16:02] :D [13:19:21] hnowlan: I thought we had it documented on wikitech, but I'm failing to find it ( cdanis maybe?) The other nit to do is to install the bootloaded if those were the OS disks, effectively having it on both disks, IIRC [13:19:53] <_joe_> cdanis: I was mostly thi8nking of having some puppet code that reads conftool-data and if it sees a discrepancy with what's in profile::realserver::lvs_pools just fails to run [13:20:13] <_joe_> the alternative is to generate conftool-data from puppetdb [13:20:54] volans: hnowlan: grub-install /dev/sdX should just work for that; it's not documented as the original plan of record was to have dcops do that on any disk replacement, I think? T215183 [13:20:54] T215183: Redundant bootloaders for software RAID - https://phabricator.wikimedia.org/T215183 [13:21:43] if having an assertion failure when they don't match is easy, we should do that first [13:21:50] generating it from puppetdb is appealing, though [13:21:51] ah interesting [13:22:08] since I asked I've realised the machine just has to be totally reimaged anyway [13:22:11] heh [13:22:32] in that case, it will get fixed up by the reimage :D [13:22:54] also hnowlan not sure of how its partitioned, but, you might be able to use kormat's recent work to persist /srv across a reimage [13:24:35] Unfortunately /srv/ is part of the problem, on restbase-dev hosts it's raid0 and the machine has been RO for so long that the data is quite stale [13:25:15] ack [13:28:16] godog: \o/ i got a puppetdb + cumin host running in a pontoon env [13:28:32] it's beautifully clean (please don't look at the code) [13:28:47] lol [13:31:04] kormat: \o/ \o/ [13:31:22] * godog peeks [13:32:21] actually no, I'll keep the clean image in my head [13:32:23] is puppetboard broken for anyone else? [13:32:34] seeing repeated HTTP 502s for the API requests it's making [13:32:58] on https://puppetboard.wikimedia.org/reports specifically [13:34:12] ahah https://logstash.wikimedia.org/goto/bd2219d0247533c227eadf2bf3a0214b [13:34:29] godog, kormat, re: pontoon and puppet... [13:34:31] https://www.irccloud.com/pastebin/HEZEBT2t/ [13:35:02] andrewbogott: phew, none of those are mine ;) [13:35:43] andrewbogott: hah! that'd be me alright [13:36:24] cdanis: what are you doing to puppetboard? [13:36:33] trying to use it at all [13:36:39] very rude of me, I realize [13:37:04] cdanis: i wonder if it's somehow related to me running a puppet master in a cloud vps.. [13:37:14] that would be very surprising [13:38:15] puppetboard loads for me now (and i didn't change anything on my end :) [13:38:26] the /reports page? [13:38:42] I think I've seen passing by an issue on their GH [13:38:48] and we have not updated it in a while [13:38:55] not sure if related [13:39:20] cdanis: mm, no. the front page loaded ok.. /reports is taking a loooong time [13:39:34] kormat: if you look in your network inspector i'm pretty sure you'll see lots of repeated 502s [13:39:49] yep [13:39:54] what do you need reports for? :D [13:39:59] * jbond42 also seeing the 502 on reports/json?draw=... [13:40:10] cdanis: yep [13:42:14] volans: I saw something that suggested puppet might have already errored on mw2335 and friends yesterday before teh fix [13:42:17] and wanted to check [13:42:35] also /catalogs is broken [13:42:46] just go to https://puppetboard.wikimedia.org/node/mw2335.codfw.wmnet [13:42:51] bottom left, show all [13:42:58] or next [13:43:11] we keep only 24h though [13:43:34] catalogs broken is new to me [13:43:42] has always worked and I've used it many times [13:44:20] jbond42: I'm wondering if we chang anything recently in the env that might cause that [13:44:25] given that we didn't touch puppetboard [13:44:40] unless it's related to some CAS-related work [13:44:53] not sure, just looking over it, i see a bunch of warnings in the logs [13:44:54] unable to add HTTP_X_CAS_AUTHENTICATIONMETHOD=LdapAuthenticationHandler,U2FAuthenticationHandler to uwsgi packet, consider increasing buffer size [13:46:05] which leads me to https://stackoverflow.com/questions/32637400/uwsgi-seems-to-break-when-queries-sent-to-it-are-too-long so wondering if the large query and other data in the environment is causing the request buffer to get too large (but still looking) [13:47:04] jbond42: this is a better writeup https://www.georgeyk.com.br/blog/uwsgi-nginx-error/ [13:47:20] the query parameters sent with the request are very large [13:47:21] thx looking [13:47:28] I suspect we were just on the edge before [13:48:02] yes different variables show up in the log mostly HTTP_CONNECTION=Keep-Alive, but alos others [13:51:25] cdanis: is that working for you now? [13:53:19] jbond42: 👍 yep, thanks [13:53:34] cool will send a puppet patch shortly [13:54:36] <_joe_> cdanis: so one thing I don't like about using puppetdb to generate conftool data is that stuff like what akosiaris is doing (migrating one lvs service from iron to k8s) [13:54:43] <_joe_> would become impervious [13:55:55] k8s is iron as well :P [13:56:39] <_joe_> akosiaris: heathen! [13:56:49] <_joe_> you're so not cloud native [13:57:14] dude, how you been to the clouds? It's so damn cloud up there. [13:57:22] and cold as well [13:57:40] <_joe_> today I'd take cold in any form [13:57:57] I 'll skip the cloud part, talk to me when 've reached space [21:20:39] I've got the first iteration of the incident report for today's WDQS outage(s) done - is there a certain google drive folder or something I need to stick the doc in? [21:23:43] ryankemper: https://drive.google.com/drive/folders/1NaGMm8KzCWkVtQm4NbnS49ifbbbTO9bM is fine for your WIP, normally they live on wikitech unless they're sensitive [21:23:50] ryankemper: the two usual places are https://wikitech.wikimedia.org/wiki/Incident_documentation (for documents that can be made public, which I suspect this one is modulo redacting the blocked IP address), and-- the one that rzl just said [21:31:28] Thanks, I've moved the google doc (https://docs.google.com/document/d/18uf243lCXXZzOmYKg1VGG9KWLdWt-CK5uc3Ft-kkMzk/) to that folder. Will create the (IP-address-redacted) wikitech page in a bit [21:33:37] thanks for the writeup! [21:34:17] thanks for all the help [22:22:45] mutante: are you at all familiar with partman vs. lvm vs. lsblk? The final result of this (the actual mount points and space) look right to me, but… it seems like there's at least one extra layer of wrapping [22:22:57] https://www.irccloud.com/pastebin/09b5o1z2/ [22:23:12] That should be a hw raid1 of two 240Gb drives [22:23:28] I guess I don't understand why sda1 and sda2, unless that's an lvm thing [22:24:37] (that question is open for anyone who wants to tell me that it looks fine and normal so I can get on with things) [22:27:05] um… bstorm if you're here, ^ is about one of the new osd nodes [22:29:44] huh. I have to admit that display looks slightly surprising to me [22:29:52] andrewbogott: i don't know a lot about it but the partman recipe has "1dev" in the name and there is a comment saying /dev/sda is the hardware raid device [22:30:13] i guess sda1 is boot and sda2 the rest? [22:30:41] Yeah, it seems right now that I realize that sda1 is 285M not G [22:30:47] 285M though? The drive is onlt 240G [22:30:47] My eyes saw G at first for no reason [22:30:51] Oh [22:30:53] oh! [22:31:01] that makes a lot more sense [22:31:16] so in that case maybe this is all fine [22:31:18] yeah, /boot wouldn't have lvm [22:31:37] 9 d-i grub-installer/bootdev string /dev/sda [22:31:45] yea, i guess it's normal then [22:31:53] I think it was just a visual quirk that caught me on a glance as well :) [22:32:23] so here's the full output: [22:32:25] https://www.irccloud.com/pastebin/Ioi7COIK/ [22:32:40] Looks to me like a system disk, and then a bunch of untouched/unpartitioned disks [22:32:55] compare how it looks on an existing host using that recipe. for example "sodium" or snapshot1005 are [22:33:15] so that seems like what we want? On cloudcephosd1003 each of those bigger disks has lvm but I'm hoping that's a thing the ceph puppet class does [22:33:23] That sounds like an OSD. They are supposed to be as close to JBOD as possible for the rest fo the drives [22:33:44] yeah, ok, looks the same as sodium [22:34:00] thanks all! I'm going to try to get the other osd nodes to look like this one :) [22:34:04] 👍🏻 [22:34:13] Thanks, andrewbogott!