[11:00:09] eehhhh wtf gitiles https://gerrit.wikimedia.org/r/plugins/gitiles/operations/homer/public/+blame/master/templates/cr.conf [11:00:24] lines 61-63 link to https://gerrit.wikimedia.org/r/plugins/gitiles/operations/homer/public/+/caf7b4f0ba51a9c28d2ea2ea4518ad8dad56e92e which is 404? [11:00:46] other lines work fine [11:01:10] searching for that short id in gerrit is no results [11:01:13] (caf7b4f) [11:02:47] cdanis: https://github.com/wikimedia/operations-homer-public/commit/caf7b4f0ba51a9c28d2ea2ea4518ad8dad56e92e [11:02:55] 9 months ago [11:03:10] so why not in gerrit/gitiles?? [11:04:13] cdanis: so, we had as a requirement to keep history [11:04:31] of the private repo that para.void had used in his tool that was the precursor of homer [11:04:34] right [11:04:40] a big effor in cleaning up that repo was done [11:05:00] and that history was pushed without creating a CR for each old commit [11:05:05] would have been only spammy [11:05:18] the repo had been checked and audited by multiple people before pushing it publicly [11:05:33] dunno why gitiles shows it that way though [11:05:39] hm [11:07:40] it's in the git repo, ok not in gerrit, but still... [11:08:16] gitiles must use some of the same indexes as gerrit [11:09:17] it's listed here [11:09:18] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/homer/public/+log/HEAD/?s=1d3f461eeb7ed260bdafd5a0597d8b3c1a45e453 [11:09:22] second from the top [11:09:38] a *lot* of those are 404s :D [11:09:42] I'm wondering if we should trigger some sort of re-index job in gerrit [11:10:17] my bad for not have checking gitiles, good catch :) [11:24:13] herron: helloo!! qq - I have checked in spickerack/cookbooks the work that gehe*l did to roll restart ES etc.. (really great!) and I am wondering if we re-use those cookbooks elsewhere (and if not, what can we do to expand the codebase?) [11:24:40] elukey: <3 [11:26:52] kafka/mirror-maker cookbooks should be also usable for logging [11:28:57] (let me know if you guys don't use them, we can improve them if anything is missing) [14:37:43] elukey: hey! hmm I haven't tried it yet for the logstash ES nodes since they aren't too bad to roll through at just 3 data nodes per site. but the cluster size is growing and I totally should next time [14:40:23] herron: o/ - I think that we might need to create a specific cookbook for non-search use cases, but even with 3 nodes it makes sense (so we can start testing it etc..) [14:40:31] also, have you used the kafka/zookeeper ones? [14:40:49] yes I did some time ago for kafka [14:41:00] super, if anything is missing let's chat about improvements [14:41:07] really open for suggestions [14:41:48] awesome, will keep that in mind and reach out if something comes to mind [14:43:19] thanks! [14:44:40] herron: maybe let's open a task for the ES roll restart cookbook involving ge*hel too? [14:45:35] ok sure [14:45:51] <3 [14:47:08] and we have a “logstash-next” cluster in pre-prod state for a few weeks that we can experiment on without affecting actual prod, so the timing is actually quite good [14:48:43] yep sounds really nice as testing env [14:48:59] the best part is that we can blame Riccardo when spicerack doesn't work [14:49:23] I kept writing cookbooks for this precise reason but didn't find any way to blame him [14:49:25] * volans at your service :) [14:51:04] lol [14:58:08] hey folks -- Monday is a WMF/US Holiday, and we're skipping the SRE meeting. I've decided to also skip the async notes for that week as well, as I have doubts over the effort/value tradeoff at this point. [14:58:35] let me know if there are urgent things that can't wait [14:59:26] \o/ ! <3 [14:59:29] <3 [15:07:15] :) [15:15:00] <_joe_> oh noes [16:25:44] just a shy idea, but maybe we could do something more team-centric (group updates per subteam, less context switching because large number of goals) [16:26:08] although that wouldn't scale for cross-team goals [16:26:37] I think maybe just at-risk goals at the big meeting [16:27:03] I think, as we get into having more EMs eventually, that kind of structure will fall out more-naturally [16:27:16] (of rolling things up per-team) [16:27:39] but things like db switchover would not fall there [16:27:42] sorry [16:27:44] dc [16:28:09] i think so too, with more (managed!) sub teams we can roll up differently [16:28:13] right now the system of OKRs and the kind of reporting we're doing in the meeting, it doesn't line up quite right when you have a subteam, because there's an additional layer of O->KR for the subteams with managers. [16:28:23] yeah [16:29:05] i think that too will also be better in the next FY with the new AP [16:29:13] AP? [16:29:16] annual plan [16:29:19] ah! [16:29:26] as we'll actually have annual plan OKRs [16:30:12] it's helpful, when digging into the complexities here, to realize there's two different hierarchical dimensions in play in the OKR system (and related) [16:30:37] one is the managerial tree (OKRs have to roll up hierarchically through the management tree) [16:30:54] and the other is the timespan hierarchy (MTP -> 1yr AP -> quarterly) [16:31:12] right now the two are blended together in one system that makes for some oddities... [16:31:41] there is also several usages, "managerial" awareness- are things ok, and if not, what we can do about it? [16:31:42] o/ [16:32:02] you can align those to some degree - with the top/C-level stuff being closer to the MTP end and ICs being closer to the quarterly end [16:32:04] but also things like blockers heads up [16:32:08] and upcoming work [16:32:16] but there's only ~3 layers of timing hierarchy, and a lot more levels of managerial hierarchy [16:32:18] and probably others I cannot think right now [16:32:52] and sometimes a low-level team (on the managerial hierarchy) might want year+ -scoped things, too [16:33:02] it's all kinds of tricky to blend this all together sanely [16:33:42] as other said, let's go throuth with current status the best we can, and when managers are in place, things may happen naturally [16:36:27] bblack: FWIW it isn't always the case that "OKRs have to roll up hierarchically through the management tree" [16:36:38] that is an artifact of our particular implementation [16:36:53] yeah I think while that often does make sense, that's not a strict requirement [18:45:12] testing the decom cookbook on a ganeti VM :) [18:46:04] mutante: thanks! I've done some testing already but would be nice to have confirmations ;) [18:46:51] i was just thinking maybe i could help the cookbook by telling it right away this is a VM and then it can skip asking for mgmt password [18:47:01] i realize it can only detect it afterwards [18:47:13] volans: worked fine. "VM removed" exit_code=0 [18:47:33] check netbox too [18:47:41] it should be at least in offline state or already removed [18:47:53] and yes, I want to improve that part too [18:48:02] status: offline [18:48:04] we can skip the mgmt if virtual [18:48:42] cool. it saves a few moments because i have to go into the pwstore, load the GPG key,etc [18:48:55] did not want to keep that in local keepass [18:48:56] you could have put anything, just enter [18:49:01] in this specific case [18:49:05] ah, duh :) [18:49:05] but I'll remove it [18:49:06] ok [18:50:36] also did "gnt-instance info " on the ganeti master node and i's already "instance ..not known". ack [18:50:58] doing another in codfw [18:51:18] yes it will be removed from netbox within few minute (don't recall if 10 or 30) automatically [18:51:25] we could force a second sync [18:51:29] but thought was not worthed [18:51:33] open for suggestions [18:51:50] "within few minutes" sounds just fine to me! [18:53:38] volans: i tried just hitting enter. but fyi that is: raise SpicerackError('Empty Management Password') [18:53:52] ah right [18:53:54] put anything [18:53:56] asda [18:54:04] ACK, that works :) [18:54:07] my ocd prevented from empty password :D [18:54:14] heh, *nod* [18:54:33] I also want to improve that part because if you have to decom multiple phisical I think it asks it for each one, want it to cache it [18:55:08] i just made a task yesterday that we need to decom at least 15 physical boxes all in the same rack and role :) [18:55:22] ok, when will that be? [18:55:39] Prio-High-soonish [18:55:50] because it blocks racking all the new codfw mw servers [18:56:01] well.. 15 of them [18:56:32] but i am not even saying we necessarily want to decom them all in the same moment [18:56:49] maybe we have to be a bit more careful when removing them and watch performance [18:57:14] sure but you probably want anyway to do them in some batches [18:57:20] so while multiple servers is nice to have it's not very important [18:57:21] sure, yea [18:57:47] so for now: https://gerrit.wikimedia.org/r/#/c/577646/ [18:58:43] cool, so it does already know is_virtual before, nice +1 [18:59:42] pulled from netbox i guess? [19:00:02] the first one is also gone from netbox now. second one status offline. works. [19:00:31] cdanis: yes [19:01:15] mutante: yes the sync of the various Ganeti clusters are splayed a bit, so each Ganeti cluster will be sync with few min diff between each other [19:02:41] yep, makes sense [19:11:34] mutante: if possible ping me before starting that round of decom, it would even be nicer if we could add the call to this custom script (and subsequent CR) [19:11:37] https://gerrit.wikimedia.org/r/c/operations/software/netbox-extras/+/576461 [19:13:03] volans: ok, will do [19:13:28] * volans bbiab [22:33:24] "Resource type not found: Wmflib::Service at /etc/puppet/modules/wmflib/functions/service/fetch.pp:4:51" [22:33:29] anyone know how that can be possible? [22:36:37] it's pretty clearly defined in modules/wmflib/types/service.pp [22:37:25] puppetmaster in this case is on 4.8.2 as it's still a stretch machine [23:17:40] comparing `curl --cert /var/lib/puppet/ssl/certs/deployment-snapshot01.deployment-prep.eqiad.wmflabs.pem --key /var/lib/puppet/ssl/private_keys/deployment-snapshot01.deployment-prep.eqiad.wmflabs.pem 'https://deployment-dumps-puppetmaster02.deployment-prep.eqiad.wmflabs:8140/puppet/v3/catalog/deployment-snapshot01.deployment-prep.eqiad.wmflabs?environment=production' | jq .` to the same against deployment-puppetmaster04 makes it look like it's [23:17:40] either going to be a version difference (puppetmaster04 runs buster with puppet 5.5.10, dumps-puppetmaster02 runs stretch with puppet 4.8.2) or something fixed in our cherry-picks on puppetmaster04 [23:28:00] * Krenair shrug [23:28:06] works with the new one, oh well [23:54:33] time to check again when to use webproxy.eqiad.wmnet and when to use url-downloader.wikimedia.org. both are http_proxies