[07:27:54] marostegui: I belive yours and _joe_ issue is the same as https://phabricator.wikimedia.org/T236114 [07:28:29] paladox: I didn't know about that ticket. THanks, I reopened https://phabricator.wikimedia.org/T234533 [07:28:35] as that was the closer I knew about this thing :) [07:28:39] Thanks! [07:28:54] Should I report there? [07:29:44] You can, but I can tell Tyler since he knows the fix. [07:33:19] Just commented - thanks :) [07:44:06] Ok :) [10:38:28] I've logged to SAL, but just to be clear, dbtree andtendril will be down during pdu maintenance [11:59:55] sorry for the basic question, I read https://wikitech.wikimedia.org/wiki/Ganeti#Reinstall_/_Reimage_a_VM [12:00:13] but I am not sure what are the proper options for wmf-auto-reimage [12:00:20] --no-pxe would be enough? [12:00:57] I've not isntalled a vm since wmf-auto-reimage was created [12:01:36] or should I just do it manually? [12:02:29] for dbmonitor you can simply run "sudo -E wmf-auto-reimage dbmonitor2001.wikimedia.org" [12:02:59] but it asks me for a mgmt domain [12:03:22] yeah, that's because it can't figure out the mgmt for a wikimedia.org server currently [12:03:43] sure, but it doesn't exist, it is a vm [12:04:11] should I just put something random? [12:04:13] ah, sure, didn't think about it being a VM [12:04:38] I don't believe it currently supports Ganeti VMs [12:04:42] ok [12:04:46] so manual it is, np [12:04:49] thanks! [12:04:59] yeah, https://wikitech.wikimedia.org/wiki/Ganeti#Reinstall_/_Reimage_a_VM [12:07:10] Moritzm: did gerrit show an error when you pressed saved? [12:07:11] * paladox thinks this is perms errors [12:07:46] no, I'm not getting an error when clicking Save [12:08:44] paladox: Hmmh, actually I've just tried with another patch and it worked there [12:09:25] but it failed for https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/545254/ (but can't use that one to repro as now merged) [12:09:40] Ok [12:09:41] Yeh, worked for me last night [12:09:42] I think we just need to chown the repo as gerrit2 [12:10:03] Oh... that’s a new change (e.g after the migration) [12:10:38] yeah, that's probably the difference, the patch which worked, is an older one of mine, while the failing one was created after the gerrit1001 migration [12:14:37] * paladox tries [12:16:59] moritzm: worked for me https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/545262/ [12:17:31] I'll retry with the next patch I create [12:17:41] Thanks! [12:18:11] Was this with PolyGerrits ui it GWTUI? [12:22:13] *Or [12:29:21] not sure, whatever is the classic one [12:31:57] GWTUI [12:32:11] than that one :-) [14:09:56] moritzm: I think it was a one time issue (gerrit would have reported a error and logged it if it failed) [14:10:03] Has it worked for you since? [14:13:05] in a meeting, will check later [14:13:29] Ok [14:16:58] XioNoX: we see some network issues in CloudVPS that may be related to cr1-eqiad. Packet loss [14:17:29] arturo: the cr1 alert was related to just a transit port, for external connectivity [14:17:32] and is since resolved [14:17:50] For example, traffic between bast1002 and ldap-ro.eqiad.wikimedia.org (which is an lvs service IP) [14:17:52] packet loss to ldap-ro.eqiad.wikimedia.org from cloudvps and bast1002.wikimedia.org [14:18:03] https://www.irccloud.com/pastebin/vi5elLvy/ [14:19:16] Arzhel is probably not working yet, pretty early in CA [14:19:57] I'm not sure who's next in line for in-dc networking things [14:19:59] Arzhel is in the middle of working on the esams refresh :) [14:20:12] oh, he's in NL? [14:20:18] then he should be awake at least :) [14:23:48] https://www.irccloud.com/pastebin/6UK580qW/ [14:23:59] ^ taking wmcs stuff out of the equation, still looks pretty bad [14:25:15] It's only 1014 that has the loss, 1013, 1015, 1016 look fine to me [14:26:27] akosiaris or effie, could I get some lvs help? we're having a partial outage [14:27:28] 1014 what? [14:27:35] lvs1014 [14:27:44] bblack: see my mtr paste just a few lines up? [14:28:18] XioNoX: ^ someone's observing packet loss to lvs1014 in eqiad [14:30:03] bblack: can it be the host with the increase of traffic to eqiad? [14:30:26] I don't see any internal link saturating [14:30:43] https://grafana.wikimedia.org/d/000000377/host-overview?refresh=5m&panelId=11&fullscreen&orgId=1&var-server=lvs1014&var-datasource=eqiad%20prometheus%2Fops&var-cluster=lvs [14:30:50] XioNoX: I've just observed 30% packet loss between cp1075 and lvs1014 [14:31:02] lvs1014 has been spewing errors for an hour now [14:31:14] 300kpps (!) [14:31:28] interesting [14:31:49] https://grafana.wikimedia.org/d/000000377/host-overview?refresh=5m&panelId=8&fullscreen&orgId=1&var-server=lvs1014&var-datasource=eqiad%20prometheus%2Fops&var-cluster=lvs [14:31:54] ksoftirqd is maxing out one core [14:31:57] it correlates exactly with tx/bytes flatlining there [14:32:11] ok that makes some sense ema [14:33:07] ema: is it masked just to that core in the scheduler? [14:33:09] note that repooling esams is still an option and I'll do my maintenance at another time [14:33:44] do we want to maybe stop pybal on lvs1014 and let the secondary kick in? [14:34:07] it will jus tmove the problem if it's real traffic [14:34:23] do we have a really crazy single client with heavy traffic? [14:34:53] 1014 is where maps traffic goes [14:34:55] is it that? [14:36:08] bblack: there is something wrong with the upgrade, I'll postpone it anyway, preparing a patch to repool the site [14:36:26] heh I just started pushing the geodns stuff [14:36:43] ok [14:37:08] XioNoX: asw upgrade? [14:37:13] yeah [14:37:31] for some reason it worked on the non prod devices but not on the prod ones [14:37:50] let's hold a few minutes before unwinding all the depool stuff [14:38:04] may as well observe the effects of the geodns reshuffling so we have some confidence in what it does when we do it again [14:38:08] since it's already out there [14:39:57] bblack: https://gerrit.wikimedia.org/r/c/operations/dns/+/545298 le me know when to merge it [14:40:10] XioNoX: it will be ~15 mins [14:41:24] bblack: varnishtop on upload shows normal thumbs requests, not maps [14:43:16] ✔️ cdanis@lvs1014.eqiad.wmnet ~ 🕥☕ taskset -p 3 [14:43:19] pid 3's current affinity mask: 1 [14:43:22] should that be so? [14:43:47] yes [14:43:50] ok [14:43:54] (probably) [14:44:25] there's some tuning on these that hashes traffic (by source flow info) onto several hardware IRQs and softirqs, and pins the kernel stuff out to specific CPUs to handle them etc [14:44:28] ok it's the same on other lvses [14:44:32] which is why if it's one core, it's probably an isolated client IP [14:44:51] sending an unholy amount of traffic or something? [14:45:00] (or receiving, as the case may be) [14:46:06] it does smell fishy that it happened to be CPU0 though [14:46:37] regardless, what's the source of the pps issue? [14:46:48] if it's going to be a bit, can/should we move our ldap traffic to a different lvs? Right now none of our users can ssh anywhere. [14:54:35] bblack: that seems to have fixed things for wmcs, thanks!