[00:04:21] volans: I doubt it, probably just issues with esams cp hw and/or our puppetization in general, it failed at the puppetize->reboot part at the end somehow [00:04:44] (but was easy to manually get it going) [03:29:56] 10Traffic, 10ExternalGuidance, 10Operations, 10Patch-For-Review: Deliver mobile-based version for automatic translations - https://phabricator.wikimedia.org/T212197 (10santhosh) We don't have any other blockers for this puppet patch. Once this puppet patch is deployed, we have a configuration change https... [06:49:10] 10netops, 10Operations: librenms logrotate script seems not working - https://phabricator.wikimedia.org/T224502 (10elukey) New error! ` /etc/cron.daily/logrotate: error: error setting owner of /var/log/librenms/daily.log to uid 498 and gid 0: Operation not permitted run-parts: /etc/cron.daily/logrotate exited... [09:06:50] bblack: so I've checked the logs, and AFAICT you reimaged cp3044 twice (why?). The first time it went all well up to after the reboot when it was polling for the successful completion of the puppet run @reboot and got KeyboardInterrupt (did you ctrl+c it?). [09:09:45] The second time was not really needed as the host had already been successfully reimaged basically, that one failed to reboot the host after the first puppet run because of mismatching key (Host key verification failed) while SSHing to reboot it. [09:10:29] This one actually puzzles me given that the script does do a puppet run on the cumin host from where you're running the script exactly before the reboot to ensure it has the right known host key. I'm digging a bit more. [09:14:47] and effectively that puppet run was a noop: https://puppetboard.wikimedia.org/report/cumin1001.eqiad.wmnet/072d9b0d5fb550a00a292b80ec0ba96da3ff808c [09:15:10] while the run that applied the new key for cp3044 was the next one at 20:05: https://puppetboard.wikimedia.org/report/cumin1001.eqiad.wmnet/42f3ff60ec49b4885984a334a0900079ea26e710 [09:33:12] from a first look the timeline makes sense: [09:33:41] - 19:49:23 cp3044 puppet-agent[1096]: Applied catalog in 1305.68 seconds [09:34:13] - 19:49:24 puppetmaster1001 apache logs: PUT /puppet/v3/report/cp3044.esams.wmnet?environment=production& [09:34:25] - 19:49:32 cumin1001 puppet-agent starts to sync the new known hosts key [09:41:54] *but* then we have [09:42:07] - 19:50:27 cp3044 puppet-agent[27663]: Using configured environment 'production' (another puppet run) [09:44:31] (that does *a lot* of things, in particular updates/install all the facter plugins that takes quite a while) [09:44:50] - 19:51:01 cp3044 puppet-agent[27663]: Loading facts [09:44:57] - 19:51:16 cp3044 puppet-agent[27663]: Caching catalog for cp3044.esams.wmnet [09:45:56] - 19:51:17 cp3044 puppet-agent[27663]: Applying configuration version '1559159463' (puppet run starts applying stuff) [09:46:07] - 19:51:54 cp3044 puppet-agent[27663]: Applied catalog in 37.18 seconds [09:48:16] and ofc the questions are 1. why it had to do stuff in this second run, means that our puppettization is not able to do all in one run (and we should fix it) and 2. what started that run given that the crontab is set for 28,58 (at least now) [09:54:54] and ofc the host's ssh_host_ecdsa_key.pub was created during d-i and not modified (19:20:50) [09:55:16] --- EOF --- [12:32:05] volans: the reason for two reimages is I failed to have the intended new puppetization merged when I kicked off the first one (so it just reimaged to its old role setup) [12:32:42] volans: the second reimage was to the correct puppetization, and after the reimage script dropped out, I was able to ssh in manually and fix it (go through agent runs and reboots, etc) [12:33:14] so your confusing agent run is probably my first manual one [12:33:28] and yeah, our puppetization is way off of being able to execute in a single run [12:34:34] but it's non-trivial to fix. may get easier once we've killed more legacy varnish-be code for this case. I've identified 1-2 key issues that create the most problems in getting to a single-run puppetization, but waiting for e.ma to get back and discuss, since he was actively working on this stuff and may understand it better. [12:35:33] But FTR, the most-annoying initial puppetization issues for the new ats-be upload nodes seem to be: [12:37:02] 1) Nothing seems to be suppressing systemd activation/start of the default "varnish.service". We still need the package, and we still configure our separate custom "varnish-frontend.service". The old 2-varnish setup overrides/overwrites "varnish.service" to use as the backend instance, but the ats-be setup doesn't do so, and then the default one ends up starting and having a listen-port conflic [12:37:08] t with other things puppet's trying to install/start [12:37:23] which is fixable by stopping varnish.service manually once or twice while getting through the initial multi-agent-run mess [12:38:29] (probably for now, given we'll have mixed ats/varnish nodes for some time during transition, the ats variant should actively disable varnish.service and get it done at the right time in puppet dependency terms) [12:39:48] 2) "update-ocsp-all" -vs- nginx install-time issues: Basically when trying to install our custom nginx package, it wants to end up executing update-ocsp-all during postinst, but it doesn't have any configured certs yet and fails, which then breaks the postinst and then that breaks other package installs, etc... [12:40:26] (I've taken to just editing update-ocsp-all itself and putting an exit 0 at the top to get postinst to succeed, then later puppet fixes the script and everything eventually gets deployed and stapled) [12:44:48] actually I'm not even sure precisely when the initial update-ocsp-all fails. it may not actually be for lack of certs, it may be because of a non-puppet dependency loop in how this is hooked together. [12:45:32] update-ocsp-all has a hook to reload nginx after its done messing with certs, and nginx postinst also wants to run it on install, so it's probably as simple as that conflict. [12:51:37] got is, so the timeline is now clear. What's yet not clear is why cumin1001 didn't get the known hosts key updated, I've half of an idea, I'll look in a bit if I can prove it [13:13:22] 10Traffic, 10Operations, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in esams - https://phabricator.wikimedia.org/T222937 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by bblack on cumin1001.eqiad.wmnet for hosts: ` ['cp3047.esams.wmnet'] ` The log can be found i... [13:15:07] volans: whatever it is, I'm sure manual intervention caused it, I don't know if it's worth digging too hard [13:15:45] in general even on a good day these don't complete on their own due to puppetization issues. It's just that usually in a cleaner run, I'm able to ssh in and get puppet working before the script times out, so that it ends up reporting success. [13:17:32] https://blog.apnic.net/2019/05/30/public-dns-in-taiwan-the-latest-victim-of-bgp-hijack/ [13:17:33] sure, but that specific failure is the first time I see it and a bit weird [13:34:03] ok I have an explanation, thanks to jbond42 for the support and forcing me to re-look from the start [13:34:50] in some cases we endup having known hosts in /root/.ssh/known_hosts [13:35:09] and those are also used to verify identity and if it mismatches, it fails [13:36:11] are you referring to that file on: the puppetmaster, the cumin server, or the target host? [13:36:20] the cumin servers [13:36:23] right [13:36:32] so again, probably caused by my manual interference [13:36:45] re-reading the error in the cumin log of the reimage I noticed the different path [13:36:48] (from using the new_install key from the cumin machine to jump in early) [13:36:57] Offending key for IP in /root/.ssh/known_hosts:43 [13:37:37] we can either add the option to cumin's config to use only the puppetized known hosts path [13:37:50] which I'm about to do again, because otherwise I can't tell if the first puppet run is ever going to install my user account and ssh key for normal access before it fails or not [13:37:52] or globally to not write there/not readfrom there [13:39:00] the worst is there's a period during the first agent run usually where new_install no longer works, but my account doesn't work for ssh either. Sometimes it can fail in that window too :/ [13:39:28] I know there was a task IIRC to make sure the removal of the install_console one is at the end of the first run [13:39:31] (yay for non-deterministic ordering of supposedly-independent aspects of the configuration) [13:39:49] yay [13:40:54] in generall we shouldn't be sshing around from the root account and thus /root/.ssh/known_hosts really shouldn't have any manual adds [13:41:17] it's kind of a strange corner case to hop on e.g. cumin1001 and do a manual root->root ssh using the new_install key [13:41:43] but install_console has UserKnownHostsFile=/dev/null [13:41:45] already [13:42:09] I didn't even know about install_console until you just mentioned it [13:42:37] how do you were ssh-ing? doing what install_console does just manually? :D [13:42:43] yes [13:42:49] lol [13:42:59] let me first delete that file [13:43:02] you have to remember, all my primary wmf-infra muscle memory is from like, back before any of this automation for reimages even existed [13:43:20] install_console is before my time, so I assumed was old [13:43:52] anyways, I'd suggest adding "-4" to the install_console ssh args. Sometimes when you need that ssh, ipv6 isn't yet working right or doesn't match DNS, etc, whereas ipv4 is statically-configured by the installer and always works. [13:44:35] oh I assumed you made install_console. For all I know it predates me and I just never knew about it. [13:44:43] !log rm /root/.ssh/known_hosts on cumin[12]001 [13:44:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:34] oh it's mid-2016 [13:45:36] https://gerrit.wikimedia.org/r/c/operations/puppet/+/274101 [13:45:44] the vintage of install_console, I mean [13:46:13] it says puppetize, I bet was there before :) [13:47:37] I guess it could share the same config I use for cumin's installer: [13:47:38] /etc/cumin/config-installer [13:47:48] with the addition of -4, I agree with you [13:53:16] anyways, my current one made it through the first puppet run to a reboot (usually that's the case!) [13:53:25] so probably I can ssh in the normal way to fixup the end bits [14:03:37] bblack: at some point could you look at https://gerrit.wikimedia.org/r/c/operations/puppet/+/512495 even if its just to tell me why this is the wrong approch? [14:07:50] jbond42: well there's long and short term views of this whole debacle. In the short term though, I think these kinds of things can help. We could maybe debate/bikeshed a couple possible approaches.... [14:09:19] i think bikeshedding can continnue in the task. this was my first attempt at writing some vlc so was just wanting a high level idea of wether this is a sane thing to do and if the implmentation looks good. not looking to advocatye this over something elses [14:09:32] s/vlc/vcl/ [14:09:52] 1) Usually with these defensive moves, frontend cache hits are cheap, it's the backend caches and the application layer behind them that we want to protect from egregious abuses. So there's an existing abstraction called " [14:09:56] sub cluster_fe_ratelimit" [14:10:16] in that same file, which applies a ratelimit only to requests which aren't frontend cache hits. [14:10:46] this reduces the traffic through the ratelimiter machine in the first place, and targets what we really care about more-precisely [14:11:39] jbond42: one problem with a static list approach is that the list of common sizes varies per wiki (and may change over time) [14:11:42] 2) Right now, all we have in there is some specific stuff about maps and about facebook. Whereas in cache_text's equivalent case, we did decide to put some upper-sanity-bound on all requests from anyone (per-client-IP) [14:12:20] jbond42 for an example: https://phabricator.wikimedia.org/T215106 [14:13:01] (I should have said: on all non-frontend-hit requests) [14:14:01] 3) cases like the one this patch is in response to (a few IPs spamming reqs for an uncommon size)... the problem comes in the cache misses, and there's no point unfairly targetting e.g. a common/popular set of actual cached 224px images that happened to exist in some wiki somewhere, etc... [14:14:21] 10Traffic, 10Operations, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in esams - https://phabricator.wikimedia.org/T222937 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp3047.esams.wmnet'] ` and were **ALL** successful. [14:14:44] so I'd tend to think avoid digging into the "common size" part of the problem, and go for establishing a general upper-bound ratelimit on frontend cache misses per second per client IP like text does, which also would've covered this case. [14:15:07] (and also exclude our own infra and such, like text does. maybe don't exclude wmcs in this case, though) [14:15:21] 3 is basically my comment [14:15:44] we are usualy quite happy to change default thumbs for small wikis [14:16:33] I think, as with many things, "app should do better" but probably it doesn't [14:16:40] jbond42: so that's my thinking on the short term end of this problem. You can kinda meld what you're doing with examples from cache_text equivalent to get closer, and we can maybe take a peek at hadoop data to find reasonable upper bounds to limit at, etc. [14:17:03] thanks both this is usefull i have not really played with varnish before and from the bit of research i did over the weekend and looking at our config it seems like we are really using a lot of the knobs [14:17:17] in the long view, it's generally-problematic that we offer an unbounded image resizing service to whatever size for whatever random linking site for everyone, when resizing is known-expensive. [14:17:19] yeah, if the rate is much higher than a reasonable one [14:17:27] for legitimate usages [14:17:42] bblack: tbh i was really just looking for some pointes like this. ill keep an eye on the task and whatch how that progresses and possibly have another crack later [14:17:45] it will work, I think, vs malicious attempts [14:17:48] for now ill abandon this change [14:18:23] jbond42: I think you are in the right track [14:18:26] we should probably eventually develop a solution to this in policy terms deeper in the stack (e.g. limit the possible size values we accept at all and readjust requests to the next-bigger of the supported global sizes, or something, and get MW editing process on board with typically using them that way, etc) [14:18:42] I was just suggesting like bblack to look at statistics [14:19:12] jbond42: if you have time, feel free to keep amending in new directions or whatever. We should do *something* [14:19:13] thank jynus dont worry you have not put me off :) [14:19:13] bblack: fixing the application is allways the right solution! [14:19:36] but not necesarily the one we can do :-D [14:20:01] ok in that case i can try and refresh this to move to something a bit closer to what has been descussed, it can be there as an optin to reject at a later date if there is something better :) [14:20:03] that is why I have rate limits on the db, and there are no good ones on app [14:21:44] jbond42: I think we could/should deploy the short-term fix now really, but more in line with above [14:22:15] ok ill try have another crack today or tomorrow then :) [15:01:36] bblack, vgutierrez [15:03:07] vg is out [15:03:11] just you and me again heh [15:03:22] I got distracted by IRC :) [15:33:54] bblack: fyi i did a refresh of that change if you get a sec, although ill probably take a break in a bit so may be awfk [15:38:23] ok thanks! [16:24:34] bblack: me and alex are doing a traffic section for tomorrow's talk [16:24:42] which means for next week's talk [16:25:09] what are the top 5-6 things you think we should discuss [16:25:20] apart from eg geodns and local caching pops [16:25:29] and gdnsd [16:25:29] what's tomorrow talk? [16:25:32] jijiki: I'm lost on context [16:25:38] what talk when/where to whom? [16:25:41] ah sorry, there was an invite [16:25:44] for tomorrow [16:25:47] ok ok ok [16:25:51] context! [16:26:07] https://whova.com/embedded/session/voxxe_201906/589826/ [16:27:08] voxxed athens [16:27:14] the structure is part I: application layer, databases, storage [16:27:18] so from in to out [16:27:22] and part II [16:27:33] starts with traffic, so from out -> in [16:27:39] (if I am making any sense) [16:27:41] yes [16:28:34] so for the traffic part with alex we will mention [16:28:35] feel free to re-use my diagrams - https://wikitech.wikimedia.org/wiki/Network_design :) [16:28:40] I have some philosophical musing brewing about how people generally describe and talk about layers of software/infra and how it shapes misperceptions, but not necessarily relevant feedback to you right now :) [16:29:31] XioNoX: I found it and I thought about it, but then I thought of drawing something more simple because the audience will be full of devs [16:29:58] bblack: I will send you the slides tomorrow :d [16:30:09] for the time being we have [16:30:25] from the outside-in, and skipping over tons of details you know or can ask about, the general picture I'd expect is something like: DNS lookups (gdnsd / geodns stuff, routing to approximate nearest-edge, edge-site depool control here and 10 minute TTL, depending on level of detail) [16:30:56] and then hitting the front edge caches and how that's chashed on client IP and they're all storing the same redundant/hot hits in memory and get the lion's share of our hits. [16:31:13] and then chashing by-URL to the next layer of larger un-redundant disk storage caches [16:31:21] and then hitting the applayer [16:31:22] we will also mention LVS ofc [16:31:26] well and I skipped over LVS yeah [16:31:31] hahaha [16:31:39] LVS happens twice: routers->edge-caches, and then again at backend-caches->applayer [16:31:54] we are skipping the backend-caches->applayer [16:31:59] for simplicity [16:32:17] anyways, I'm sure you two have a reasonable picture of things, but yes if you send me some slide preview I can just focus on whether there's any critical mistake/omission. [16:33:34] (or I can see them during the presentation tomorrow) [16:33:48] sure [16:34:00] it will be a little rough ofc [16:34:47] don't let the perfect be the enemy of the good :) [16:34:54] lol [16:34:57] tx bblack [16:34:58] I think that's one of my most-common cardinal sins [16:36:19] another thing, do we have a couple of let's say fun examples we could tell [16:36:37] of issues we had due to too much traffic [16:36:45] ok fun is not a good word here [16:36:59] I would go with 'understandable examples' [16:37:49] or XioNoX :D [16:40:23] I'd need an example example to think of an example [16:41:15] ok for instance how facebook servers cause us issues sometimes [16:43:18] or maybe small things we have tweaked which if this was a small site, we would never bother [16:46:36] jijiki: the ping offload servers maybe? Like we get so many pings, that it trigers linux's icmp rate limiters and (could?) causes valuable icmp (like pmtud) to be dropped [16:46:57] ah that is a nice funfact [16:48:29] 10netops, 10Operations, 10Operations-Software-Development, 10netbox, and 2 others: Netbox report to validate network equipment data - https://phabricator.wikimedia.org/T221507 (10ayounsi) Opened T224679 [17:02:23] XioNoX: if you remember of another, please ping :D [17:02:38] we should have a funfacts wiki page [17:02:56] we could add one on wikitech [17:03:07] of I gather a few, I a definitelly up for it [17:03:12] and starting it [17:03:14] maybe skimming through incident reports [17:04:25] nah too much work, I am have a quick look tomorrow [17:05:18] one of our most interesting/relatable things we've used in past such presentations has been celebrity death spikes [17:05:51] celebrity dies -> everyone links to wikipedia to read about them, while simultaneously a bunch of editors are spiking the edit rate constantly updating it as details about the cause of death emerge and nipicking each others' changes [17:06:07] lol right, that is a nice one as well [17:06:15] it can become a huge load problem as heavy read traffic focuses on an article that's constantly being invalidated from caches, etc [17:07:08] we have a section where we describe a read and a write [17:07:37] this could fit in there actually, we have already explained some things about caching and the infra [17:08:34] * jijiki writes this down as well [17:11:35] I also have in the list that after the US elections, traffic from FB as increased [17:11:45] that stands yes? [17:18:37] I don't remember any details about that, but seems plausible [18:09:22] bblack: cp1008 has been alerting since 1 week - https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=cp1008&service=Freshness+of+zerofetch+successful+run+file [18:09:22] But its notifications are disabled and I can't find any note/task on why. Any idea? [18:10:34] no idea, but can't be critical. in general cp1008 is usually perma-downtimed, perhaps that expired [18:11:02] will re-downtime the whole thing [18:11:20] also I can't figure out what "Freshness of zerofetch successful run file" mean [18:12:33] There's a thing called the zerofetcher, and it touches some file to record the fact that it did its job successfully, and the icinga check looks at the mtime of that file to know that zerofetch hasn't been failing at its job recently (or failing to run at all, or whatever) [18:12:52] basically [18:13:23] either way, zerofetcher is mostly dead weight now, as Wikipedia Zero is already past its EOL date, we just haven't yet devoted time to deconstructing and cleaning up all related technical infra. [18:13:48] ok! [18:14:07] and then cp1008 is a pure testbed host, nothing on it should ever be an actionable alert (perma-downtime!). and it's also due to be replaced by a slightly better version, I think using cp1099. [18:15:10] this has been your daily dose of "hey check out these tech debt things we haven't gotten around to yet" [18:15:36] hehe, yeah while looking for any cp1008 tasks I found the cp1099 one [18:42:11] 10Traffic, 10Operations, 10ops-codfw: lvs2002: raid battery failure - https://phabricator.wikimedia.org/T213417 (10ayounsi) a:03Papaul Assigning that to Papaul as I think next steps are for DCops, but please reassign if incorrect. [18:45:28] bblack: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=cp3041&service=Varnish+frontend+child+restarted [19:12:29] 10Traffic, 10Operations: cp3041 - Varnish frontend child restarted icinga alert - https://phabricator.wikimedia.org/T224694 (10ayounsi) p:05Triage→03Normal [19:31:18] 10Traffic, 10Operations, 10serviceops, 10User-jijiki: Allow directing a percentage of API traffic to PHP7 - https://phabricator.wikimedia.org/T219129 (10Jdforrester-WMF) Given this is being used now, does this count as Resolved? Or do you want to keep this open until clean-up? [19:43:43] 10Traffic, 10Operations: cp3041 - Varnish frontend child restarted icinga alert - https://phabricator.wikimedia.org/T224694 (10BBlack) That alert basically means that a varnish frontend daemon crashed (and as usual was auto-restarted by a manager process). These are pretty rare and usually worth some investig... [21:30:05] https://blog.apnic.net/2019/05/30/public-dns-in-taiwan-the-latest-victim-of-bgp-hijack/ [21:40:26] I wonder if we could have a bot scrape out all the links we post in all public channels and make a daily reading log (and auto-exclude our own URLs since they're likely just debugging links and examples and such) [21:41:02] like an automatic blog of interesting links people mentioned in various wmf tech channels :)