[15:19:51] o/ [16:03:30] volans: is there anywhere someone is keeping track of ideas for what would go in a hypothetical wmflib module? [16:04:26] * volans opens the third drawer from the top in the 6th column on the left side of his skull [16:04:47] cdanis: shoot so I can add a note to the drawer :-P [16:04:56] volans: timestamps in logging output [16:05:36] that was actually a choice, as they are in the log files and seemed a bit boring in the output [16:05:48] but I totally agree that for some more long-lasting cookbooks they are useful [16:06:11] yeah, should be optional / easy to enable [16:06:55] that will be slightly harder because the logging is setup by spicerack, but with the class-API I should get back to, it should be easy to add and have spicerack inspect it [17:29:47] <_joe_> --timestamps on the cookbook cli ? [17:30:24] I'd say it's better for the cookbook to define it in the code, because some will most likely need it all the tiem and others never [17:30:31] but open for suggestions [18:02:09] actually, if the cookbook declares itself to be long-lived, maybe that controls timestamps, warnings if you're not in screen/tmux, and who knows what else [18:02:29] short-lived cookbooks don't do any of that stuff, long-lived cookbooks do all of it [18:02:38] that seems like a good way of doing it. [18:02:44] rlazarus: +1 [18:02:59] ok for me, but I had complaints about having too much magic sometomes ;) [18:03:29] it just has to be the right kind of magic [18:03:54] features that do what I want are convenient, features that do what I don't want are too magical [18:04:07] yeah, rzl gets it [18:04:16] rzl? who is rzl [18:04:22] I am Guy Incognito [18:04:59] 🥸 [18:05:24] oh wow I don't have that character yet but I WANT it [18:06:53] second half of 2020, supposedly [18:07:23] even more exciting, rlazarus, is that Emoji 13.0 includes 🤌 U+1F90C PINCHED FINGERS [18:07:42] this will revolutionize SRE communication at WMF [18:08:32] a way of thinking of the feature balance (not that it necessarily solves anything) is that some things are Products and some things are Tools. [18:09:05] Tools should have lots of small arcane optionality and maximal flexibility (and copious documentation, hopefully) [18:09:19] Products should have an easy surface that meets specific use-case needs and is intuitive [18:09:31] trying to make one thing satisfy both often proves frustrating [18:10:08] a long-lived flag that controls many internal settings is more product-like, and a --timestamps option is more tool like. [18:10:29] (not that we can't have both, just it makes life harder when you walk a balance) [18:11:15] we have more layers here though... spicerack API as a library for cookbooks, global cookbooks parameters, cookbooks code that decides what to do and cookbooks specific parameters :) [18:11:54] :) [18:12:29] git is a good solid Tool example [18:13:08] gerrit might be one of those difficult inbetween cases [18:13:19] I think that the only "product" like thing in this example are the final cookbooks [18:13:28] yeah, probably [18:13:53] although even with APIs, there's some kind of tool-product dimension, maybe those are just the wrong labels for it [18:14:21] flexible vs intuitive? although some flexible things are intuitive. [18:14:44] tools enable creative processes, and products just let you get things done. [18:16:40] or maybe another angle: products are the leaf nodes, and tools are the branches leading to them, or that they build on. and it can be challenging for a thing to be both an effective branch and an effective leaf or fruit. [18:17:23] not a very good metaphor, the more I stare at it :) [21:53:59] * cdanis looking at icinga metamonitoring alert [21:55:25] so, this alert is not-paging for the metamonitoring rules [21:55:33] but because it now has the VO contact [21:55:36] it pages through it [21:55:55] at first glace it seems more a problem with wikitech-static than icinga2001 [21:56:16] ehm [21:56:19] https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-target_site=codfw&var-ip_version=All&var-country_code=All&var-asn=All&from=now-6h&to=now [21:56:27] that looks bad [21:56:58] I can't ssh into wikitech-static from icinga2001 while I can from 1001 [21:57:13] connectivity issues? [21:57:45] yes [21:57:51] getting mtrs now and considering depooling codfw [21:58:05] yeah me too, running mtrs [21:58:17] I have 15% packet loss on icinga2001 via telia [21:59:00] no obvious traffic drop on the frontend traffic console [21:59:14] interestingly enough I have 0 packet loss from my bouncer [21:59:16] but ripe atlas agrees that codfw is having issues [21:59:27] trhough telia too [21:59:36] it's not just telia [22:00:23] hi [22:00:52] https://phabricator.wikimedia.org/P10968 [22:02:35] checking [22:02:41] did you check the return path? [22:02:55] return path from bast2002 is via telia [22:02:58] also in paste [22:04:12] ah yeah [22:04:17] v4 is the same? [22:04:18] TODO: make the alert for the passive icinga not-paging in VO [22:04:22] getting v4 now [22:04:55] paste updated with v4 [22:04:59] looks broken on rackspace's side? [22:05:56] which would explain why we don't see a noticeable user traffic drop, and why https://atlas.ripe.net/measurements/24711321/#!probes doesn't look like widespread issues [22:06:31] cdanis: how woul dthat explain my mtr locally? [22:06:45] volans: you were seeing 15% pkt loss end-to-end? not just on a middle hop? [22:07:05] cdanis: 15 on the final one, in the middle up to 90% fwiw [22:07:11] but you iknow middle hops are not reliable [22:07:14] yes [22:07:14] volans: can you share it? [22:07:42] nope, lost the backlog sorry [22:07:46] the downwards slope on the frontend traffic graph for codfw has just gotten worse, it's under 9.5k rps now, i'm going to depool out of paranoia [22:08:01] now it's working, the path seems similar, let me see [22:08:31] ehm I hope these mgmt changes are okay to merge as well [22:09:21] +1 on depool [22:09:22] XioNoX: appened at the end of https://phabricator.wikimedia.org/P10968 [22:09:29] 1~2% now [22:09:30] checking looking glass [22:09:33] but was 155 before the last line [22:09:36] *15% [22:11:09] icing2001 is unable to ping wikitech-static [22:11:48] ripe atlas data suggests reachability has recovered, but rtt still slightly elevated [22:11:54] mtr completes now [22:12:06] and the curl to icinga2001 from wt-static completes [22:12:13] cdanis: but not the return path [22:12:52] works on v4 [22:12:55] yeah [22:12:56] v6 seems broken [22:13:05] return path has worked on v4 whole time, just not the forward path on v4 [22:13:55] I think it's working again [22:14:03] v6 works now [22:14:15] ripe atlas shows reachability and rtts back to normal [22:15:02] I tried to ping it from both zayo and telia's interfaces in codfw and they started working at the same time [22:15:33] weird [22:15:52] from telia's looking glass, this was the previous prefered path: [22:15:58] https://www.irccloud.com/pastebin/MOMLk4Iy/ [22:16:06] and now it's: [22:16:09] https://www.irccloud.com/pastebin/lGgtlmMZ/ [22:16:45] so maybe the 1299/12200 peering point was having issues? [22:17:05] seems likely [22:17:34] or I guess more if the ripe atlas saw it [22:17:58] it wasn't huge on ripe atlas, but, it was noticeable [22:18:28] maybe one of their peering routers, impacting a bunch of paths [22:18:28] below 99.2% of ipv4 reachability is unusual, and we got to 98.2% [22:19:19] we could potentially peer with them in Equinix ashburn/chicago/dallas [22:19:32] https://www.peeringdb.com/net/22 [22:20:21] for external monitoring purposes isn't better not to? [22:21:00] unless we do proper external monitoring ofc with multiple edges, etc... [22:21:37] volans: yeah, I wouldn't call that external monitoring :) [22:21:52] the ripe stuff is useful though [22:26:24] I'm going back to sleep [22:26:32] * volans too [22:33:45] +1 to everything A.r.z.h.i.o.N.o.X. said [22:33:58] I will check again and repool codfw in a bit [23:06:14] AzhioNox sounds like a noisecore band