[00:09:10] 10Traffic, 10DC-Ops, 10Operations, 10ops-eqsin: singapore caching center: eqiad staging tracking task - https://phabricator.wikimedia.org/T166179#3287561 (10faidon) @RobH, this can be resolved now, right? [00:09:43] 10Traffic, 10DC-Ops, 10Operations, 10ops-eqsin: singapore caching center: eqiad staging tracking task - https://phabricator.wikimedia.org/T166179#3974242 (10RobH) 05stalled>03Resolved Yep, was just cleaning up the sub-tasks. Nothing left for this! [00:23:23] 10Traffic, 10Operations, 10ops-eqsin: dns5002 mgmt console unreachable - https://phabricator.wikimedia.org/T186902#3974268 (10BBlack) Sounds about right to me. But let's do the other two in T187158 and T187157 as well and maybe get more value out of the time. cp5006 and cp5010 both have "working" managemen... [07:31:11] 10netops, 10Analytics-Kanban, 10Operations, 10monitoring, and 2 others: Pull netflow data in realtime from Kafka via Tranquillity/Spark - https://phabricator.wikimedia.org/T181036#3974720 (10elukey) [09:33:22] 10Traffic, 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Ops Onboarding for Valentín Gutiérrez - https://phabricator.wikimedia.org/T187035#3974843 (10Vgutierrez) I just added myself to root and security aliases. Regarding the ops mailing lists, @Volans told me that there is a typo on my email ad... [10:18:09] <_joe_> ema: around? about etcd - there is a fundamental problem with watching on etcd and not using waitIndex [10:19:22] <_joe_> If I watch an etcd key with wait=True but without any waitIndex it will watch for changes happening from *now* [10:19:40] <_joe_> so it won't report to me changes happened in the last N minutes [10:20:26] so pybal could lose changes if for some reason the communication with etcd is lost [10:20:31] <_joe_> no [10:20:42] <_joe_> or well, yes, but not just in that case [10:20:52] <_joe_> say something edits multiple keys in a directory [10:20:54] <_joe_> in sequence [10:21:07] <_joe_> we would see the first change, change config, then watch again [10:21:19] <_joe_> in the meanwhile, changes might have happened on etcd [10:21:38] <_joe_> it will be *significantly* better with etcd3 that has transactions [10:23:20] <_joe_> I am thinking of adding an etcd3 driver and a cassandra driver to conftool, in case we decide we don't really want to use etcd anymore [10:30:51] hmm conftool != confd? [10:31:47] vgutierrez: indeed :) [10:32:46] https://gerrit.wikimedia.org/g/operations/software/conftool/ [11:20:07] <_joe_> vgutierrez: conftool is the tool with which we write to etcd [11:20:16] <_joe_> etcd is the data store [11:21:06] yup, I was checking the code :) [11:21:14] <_joe_> oh, please don't [11:21:18] <_joe_> it's shameful :P [11:22:27] <_joe_> the only excuse I have is the first few versions were written on a hurry, and the subsequent fixes as well [12:09:18] _joe_: hey :) [12:10:04] _joe_: we're watching single "keys" on etcd though [12:10:48] <_joe_> ema: I'm going to lunch [12:10:50] so what is the scenario in which we could miss the current status of the value of a given key? [12:10:51] <_joe_> bbiab [12:10:57] <_joe_> no, we watch directories [12:11:04] ah, TIL :) [12:13:23] so what could happen is: we lose changes happening between the moment in which a change happens and we're done updating our configuration [12:16:05] but still, once we're done updating the config, we watch again and we see the latest values in etcd [12:16:26] which is what we need to know whether a service needs to be pooled or not [12:17:27] I think? [12:24:31] <_joe_> we don't see the latest value, we see the first changed value since waitIndex [12:24:40] <_joe_> then we watch again, rinse, repeat [12:24:55] <_joe_> that's how the API works, not our choice [12:25:48] oh, I thought you'd get the latest value by issuing a request with wait=true, but no waitIndex [12:26:32] <_joe_> no [12:26:37] <_joe_> you wait for the next change [12:27:02] <_joe_> so yes, it's basically useless without waitIndex [12:27:31] fascinating [12:29:17] thanks for the explanation! [12:35:48] <_joe_> yw :) [13:31:43] 10Traffic, 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Ops Onboarding for Valentín Gutiérrez - https://phabricator.wikimedia.org/T187035#3975345 (10Dzahn) >>! In T187035#3974843, @Vgutierrez wrote: > I just added myself to root and security aliases. great! :) > typo on my email address and... [13:35:34] 10Traffic, 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Ops Onboarding for Valentín Gutiérrez - https://phabricator.wikimedia.org/T187035#3975370 (10Vgutierrez) [15:40:52] nice, wmf-upgrade-varnish --hiera-merged is working fine (cp5001 just upgraded that way) [15:41:59] nice :) [15:42:55] \o/ [15:43:19] cumin based? [15:43:24] indeed [15:43:27] nice [15:58:19] <_joe_> mark: once we have the spinoff, that will work without vampirizing things from wmf-auto-reimage [16:38:53] 10Traffic, 10Operations, 10Patch-For-Review, 10Performance-Team (Radar): Upgrade to Varnish 5 - https://phabricator.wikimedia.org/T168529#3976058 (10ema) [16:38:58] 10Traffic, 10Operations, 10Patch-For-Review, 10Performance-Team (Radar): Upgrade cache_upload to Varnish 5 - https://phabricator.wikimedia.org/T180433#3976056 (10ema) 05Open>03Resolved a:03ema [16:41:53] \o/ [16:48:49] 10Traffic, 10Operations, 10Page-Previews, 10RESTBase, and 2 others: Cached page previews not shown when refreshed - https://phabricator.wikimedia.org/T184534#3976131 (10phuedx) >>! In T184534#3967535, @BBlack wrote: > Do we want to allow stale content in the UA's cache here, for up to 5 minutes past the ex... [17:14:43] ema: idea to improve the script, temporary disable the weekly restart cron to be sure it's not in the way of the upgrades [17:15:25] 10Traffic, 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Ops Onboarding for Valentín Gutiérrez - https://phabricator.wikimedia.org/T187035#3976273 (10Dzahn) @Vgutierrez re: Icinga command permissions. should be all done. the ultimate test is if you try a "schedule downtime" or "disable/enable no... [17:28:41] bblack: I've noticed some more icinga spam due to scheduled varnish-be restarts, we could perhaps bump check_interval https://gerrit.wikimedia.org/r/411033 [17:29:32] volans: I've gotta leave soon, but please elaborate! We can discuss it tomorrow [17:30:01] ema: the other POV on these is we know from the child restart counts we've got a slight memory-usage-increase issue with v5 [17:30:19] ema: maybe that additional pressure is slowing down the rm/sync, and we need to just adjust fe mem% down slightly and then these stop spamming [17:30:46] cya :) [18:11:31] ema: add something to wmf-upgrade-varnish to protect it from having conflicts with /etc/cron.d/varnish-backend-restart running at the same time [18:12:03] I know the probability is low, but you know... Murphy's... :) [18:12:52] there's a common pattern here with e.g. puppet disables and all that [18:13:20] should all our tooling that we use from cumin and/or cronjobs use some universal exclusivity lock [18:13:23] ? [18:15:06] e.g. for things like run-puppet-agent and wmf-upgrade-varnish and varnish-backend-restart and a bunch of other tools, there could be one global lock. And since tools can call other tools, it can inherit (e.g. wmf-upgrade-varnish might disable-puppet) [18:15:07] touch /var/run/i-m-doing-something-on-this-host [18:15:17] a lockfile and an env var for inheritance? [18:16:24] if lockfile_exists and not $INHERIT_LOCKFILE; do fail; else do create_lock_and_set_env_var; fi [18:16:27] or whatever [18:17:27] maybe make a python library we can integrate in all the things to handle it properly and race-free, with a standard commandline flag for whether you wait or immediately-fail if you encounter an existing lock you didn't inherit. [18:19:13] might not be a good idea [18:19:23] s/good/bad/ [18:19:34] I guess ETOOTIRED [18:19:43] I think both statements are probably true :) [18:20:32] lol, in general I think it would help, I'm trying to think of corner cases in which it could make things worse or more complex [18:21:16] like the weekly restart, if something else is going on you'd like to wait and do it later, so a sleep might be approriate for like 1h or even more, given the weekly scale [18:21:28] hmm yeah [18:21:32] but if in that "thing" that you were doing you already restarted it, than the second restart is useless [18:21:35] I can see where this is going [18:21:44] or worse, the script itself has changed on disk [18:21:59] we clearly need an Enterprise Fleet Task Scheduler and Executor with redundant job queue hosts and... [18:22:05] and you ran the old one that is not anyore compatible, etc... [18:22:11] eheheheh [18:22:55] or alternatively, we eliminate running such scripts manually [18:23:23] then the exlclusivity is at least limited to cronjobs (incl agent runs) and cumin, which is a smaller problem domain to solve [18:23:41] I'm sure systemd™ can solve *all* those issues and get rid of crontabs [18:23:47] and everyone agrees that if you're on a manual shell, you should be readonly or you're probably risking a breaking race [18:24:18] (if you want to take actions, use single-host cumin to execute for you) [18:24:37] that brings up another past crazy idea [18:24:56] there should be some kind of read-only root shell that really works [18:25:21] as in, a way to sudo to root that uses capabilities and/or syscall filters or whatever and really prevents any state modification of the system. [18:25:40] you can look at files, you can trace things, etc. but you can't write, you can't edit, you can't send network traffic... [18:26:23] because really, most of the time you're intending to be mostly readonly on a prod root shell, and it's a nice safety feature. [18:27:11] eheh yeah, it's what we try to do lamely with ro files managed by puppet [18:27:27] yeah but :w! is like, muscle-memory at this point [18:28:22] I think you could do it with capabilities and syscall filters wrapping your shell invocation, pretty closely anyways, and only secure enough as a safety-feature, not a security-feature, of course. [18:28:52] it's clearly to avoid mistakes only [18:29:00] but would be useful [18:29:07] but honestly it should be something baked it at a pretty low level in Linux's permissions system, the idea of having uid=0-like permissions, but only for read-like operations. [18:29:48] I hate everytime I need to become root just to cd into a directory I don't have access [18:29:58] to do ls/grep/less [18:30:15] just because I don't know the structure of the files in there beforehand [18:30:24] yeah [18:30:50] then I loose my PS1, aliases, etc.. [18:31:47] I gave up on shell customization years ago, probably to my detriment in the modern era. But I grew up in a world where it wasn't always Linux and/or bash, and it wasn't always my account, and shared root login was used a lot, etc. [18:32:13] so I got used to the idea that shell customization was a net loss, because my muscle/eye memory was all wrong in the all-too-common case I didn't have my customized shell handy [18:33:15] eheheh I had that too, mostly, but for some period in the paste I was the one deciding the "base installation" [18:33:26] so the common stuff was kinda made my way :D [18:34:21] I use just few things here [21:52:57] 10Traffic, 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Ops Onboarding for Valentín Gutiérrez - https://phabricator.wikimedia.org/T187035#3977256 (10Vgutierrez) >>! In T187035#3976273, @Dzahn wrote: > @Vgutierrez re: Icinga command permissions. should be all done. the ultimate test is if you tr... [22:00:41] volans, have you tried "sudo -s" ? [22:02:03] it seems to preserve PS1, working directory, etc. [22:42:15] 10Traffic, 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Ops Onboarding for Valentín Gutiérrez - https://phabricator.wikimedia.org/T187035#3977347 (10ayounsi)