[04:19:52] <_joe_> cdanis: it's not that odd [04:20:29] <_joe_> jobrunners only run refreshlinks, and I'm pretty sure the refreshlinks job for that specific edit will be running for ~ a week [04:21:01] <_joe_> htmlcacheupdate, which is the job invalidating the cache, has probably ran in a few hours [04:23:04] <_joe_> sorry, less than an hour [09:43:30] heads up on https://phabricator.wikimedia.org/T250652 [09:43:58] not that it is highly impacting, but because it will show on icinga flapping up and down until corrected onsite [09:44:57] ^CC cdanis which I think is on clinic duty this week FYI [09:48:46] since msw1-eqiad is the "parent" of all these mgmt hosts in Icinga it should be possible to effectively downtime them all by downtimiing the switch [09:50:16] mutante: but then it would hide potential issues on other rows [09:51:34] XioNoX: oh, just one row is affected? ok [09:52:00] yep, A6 [09:54:00] is it a switch-per-row, or a switch per rack? [09:55:15] let me be more specific, I saw only A6 rack servers complain, but no other A servers [09:56:29] er, yeah there is one per rack, that aggregate to a single core mgmt switch [09:57:03] so I should have said "but then it would hide potential issues on other RACKS" [09:57:31] cool, np, I thought my understanding of net topology was wrong :-D [13:54:56] herron: godog [13:54:59] ok if i merge [13:55:01] https://gerrit.wikimedia.org/r/c/operations/puppet/+/589400/7 [13:55:02] ? [13:57:10] ottomata: hey, sure LGTM [14:03:13] godog: hm, maybe you are right..maybe the log messages from eventgate are enough. they don't contain the full raw event, but we will import the error topics into Hive, so folks can go deeper into errors if they have to. [14:03:14] hm [14:03:44] i'll still merge the refactor but not punt on the new validation error logstash inputs for now [14:03:47] but punt on* [14:04:53] ottomata: yeah I think it'll be good enough to start, if they aren't enough adding the error topics is easy enough [14:09:08] ok puppet ran on logstash1007 [14:09:21] new truststore files all deployed fine, and it look slike logstash inputs were all 'refreshed' [14:12:12] nice! yeah seems expected [14:28:52] hiya moritzm [14:29:00] i just built a package for stretch on deneb [14:29:19] but it built into the buster-amd64 results directlry [14:29:23] directory [14:29:38] the command I ran was [14:29:39] GIT_PBUILDER_AUTOCONF=no DIST=stretch gbp buildpackage -sa -us -uc --git-builder=git-pbuilder [14:29:48] and the changelog entry is [14:29:49] ua-parser (0.10.0+core0.8.0~1-1) stretch-wikimedia; urgency=low [14:31:01] i guess the .changes file has Distribution: stretch-wikimedia so it odesn' tmatter? [14:39:15] just a heads-up, nic-saturation-exporter is now enabled on all physical hosts, no impact on anything expected but it is a new binary and a bunch of new prometheus metrics on every physical machine [14:50:47] ottomata: in an interview, will ping you later [14:51:39] ok ty, just emailed you details [14:52:24] ottomata: what does the dist in the changelog file say? [14:59:24] marostegui: hey, did you know that many of the s4 DBs have pretty hot NIC tx? db1081, db1103, db1091, and to a lesser degree db1084 [15:00:34] cdanis: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/debs/python-ua-parser/+/debian/debian/changelog [15:00:43] cdanis: hot? [15:00:45] ottomata: weird :) [15:00:58] marostegui: bursts of >=90% line rate [15:01:27] cdanis: ah yes, yeah, we are aware. We are buying the new hosts with 10G NICs now just in case [15:01:39] ok! [15:01:54] cdanis: all of those are s4 btw [15:01:56] commons! [15:02:02] yes, that's what I said :) [15:02:10] ah sorry [15:02:13] db1097 as well (also s4) [15:02:31] yeah, we are aware of s4 with those issues :( [15:02:39] well, I'm glad you're already aware [15:02:47] The new s4 hosts do have 10G :) [15:08:38] _joe_: most apiservers have temporary bursts of >90% line rate on rx [15:43:06] interesting, most of the jobrunners just temporarily maxed out their rx https://w.wiki/NXf [19:17:25] check me -- running enable-puppet does not immediately trigger a puppet run, right? so it's safe to run on 400+ machines without staggering them? [19:20:02] fairly sure that's true [19:20:10] just means it can be run [19:26:01] rzl: that's my understanding [19:28:51] cool, thanks [19:37:33] anyone object to me changing the networking unit on the host overview dashboard (https://grafana.wikimedia.org/d/000000377/host-overview) from megabytes/second to megabits/second ? the latter is a lot more natural when thinking about line rate [19:42:06] relatedly there are some dashboards (which I can't find right now) that have some really odd units [19:42:10] like MBs or something like that [19:42:30] ah yeah, example: https://grafana.wikimedia.org/d/000000608/datacenter-overview?orgId=1 [19:42:46] and its sibling https://grafana.wikimedia.org/d/000000605/datacenter-global-overview?orgId=1 [20:39:39] rzl: from my experience —enable does indeed not trigger a run it just tells puppet it is now able to run [20:40:07] thanks! I didn't follow up here but that is indeed how it worked