[06:21:16] 10netops, 10Operations: Facebook BGP peering links down in ulsfo - https://phabricator.wikimedia.org/T239896 (10ayounsi) 05Open→03Resolved a:03ayounsi [09:48:36] 10netops, 10Operations, 10cloud-services-team: Return traffic to eqiad WMCS triggering FNM - https://phabricator.wikimedia.org/T240789 (10fgiunchedi) FWIW I think if the current thresholds are good at detecting DDoS we should explicitly whitelist WMCS ranges with say 1.5x the current thresholds and see how f... [10:17:28] 10netops, 10Operations, 10SRE-tools, 10Goal, and 2 others: Configuration management for network operations - https://phabricator.wikimedia.org/T228388 (10Volans) [11:12:34] I'm merging https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/558138/ (vhtcpd systemd unit) with puppet disabled FYI [11:16:54] things look sane on cp1075, the process was running as follows before: [11:16:57] /usr/sbin/vhtcpd -m 239.128.0.112 -m 239.128.0.115 -c 127.0.0.1:3128 -c 127.0.0.1:3127 1.0 start [11:17:03] and after running puppet it now is: [11:17:08] /usr/sbin/vhtcpd -m 239.128.0.112 -m 239.128.0.115 -c 127.0.0.1:3128 -c 127.0.0.1:3127 1.0 startfg [11:27:56] almost right on cp1076 too! https://gerrit.wikimedia.org/r/#/c/558474/ [11:50:35] looking good, re-enabling and running puppet after lunch! [11:51:05] BTW the icinga warnings about puppet being disabled are gone since a while [11:51:35] is it intentional? I found them fairly useful :) [12:09:04] I think they were tweaked to only fire after a long period? [12:09:27] to avoid spam for typical day-to-day like "disabling puppet for a rollout" [12:09:56] "longer period" [12:11:49] 10netops, 10Operations, 10cloud-services-team (Kanban): Return traffic to eqiad WMCS triggering FNM - https://phabricator.wikimedia.org/T240789 (10aborrero) >>! In T240789#5747102, @fgiunchedi wrote: > FWIW I think if the current thresholds are good at detecting DDoS we should explicitly whitelist WMCS range... [13:10:58] ah nice [13:21:37] 10Traffic, 10Operations, 10Patch-For-Review: ATS skipping certain logs due to lack of buffer space - https://phabricator.wikimedia.org/T237608 (10ema) 05Open→03Resolved a:03ema We have bumped buffer sizes, decreased the amount of information being logged, and added icinga checks alerting if logs are sk... [13:24:58] 10Traffic, 10Operations: vhtcpd segfaulted and did not get restarted - https://phabricator.wikimedia.org/T240826 (10ema) 05Open→03Resolved a:03ema Now that systemd properly takes care of supervising `vhtcpd.service`, the service is automatically restarted upon SIGSEGV: ` Dec 17 13:13:35 cp3050 systemd... [13:36:53] ah I missed the segfault as the reason for the systemdization earlier [13:37:05] any clues on the segfault other than 'somewhere in libc'? [13:37:20] that code hasn't generally segfaulted in the past :P [13:41:44] although the purger queue sizes are probably some kind of clue [13:42:34] they were constantly-rising leading up to it, meaning it couldn't output to the varnish or ats anymore [13:43:25] Dec 15 20:17:22 cp1075 vhtcpd[148095]: Purger0: input: 15970859342 failed: 1355211 q_size: 265942528 q_mem: 40675459219 q_max_size: 265942594 q_max_mem: 40675467672 [13:43:27] Dec 15 20:32:22 cp1075 vhtcpd[148095]: Purger0: input: 15973436578 failed: 1355211 q_size: 267722646 q_mem: 40879916804 q_max_size: 267722646 q_max_mem: 40879916804 [13:43:31] ^ last two entries before crash [13:43:58] which is 38G of queue'd purge data [13:44:22] perhaps we finally got a NULL back from some malloc-like and then tried to use it and libc faulted us? [13:44:23] that q_mem number is exactly q_max_mem [13:44:28] yeah [13:44:34] meaning the current value is the historical max [13:44:49] oh okay, I wasn't sure if q_max_mem was some absolute maximum or something [13:44:49] (max isn't config, it's just tracking historical-max) [13:45:27] I don't think vhtcpd looks for NULL allocator returns [13:45:41] it's something I changed my opinion about since whatever I copypastad into there back in the day [13:46:16] either way, we can/should perhaps just set the maximum queue size smaller [13:46:50] oh we're not setting it? [13:47:37] actually it doesn't seem to have such a setting [13:47:47] I could've sworn it had one in the past, maybe my brain is faulty [13:49:15] apparently I removed it [13:50:09] https://gerrit.wikimedia.org/r/#/c/operations/software/varnish/vhtcpd/+/384433/ is kinda awesome, because all in the same commit message in different subparts: [13:50:19] [...] There's an absolute memory size limit which wipes [13:50:19] the queue regardless, so this doesn't lead to infinite growth. [13:50:23] [...] [13:50:41] Along the way, the argument for maximum queue memory has been [13:50:41] removed (as well as the limiter code that checked for it and wiped [13:50:41] the queue). In its place, a queue_mem statistic has been added [13:50:41] that can be tracked/alerted. Obviously, if a sanity limit is [13:50:41] desired, we can put that in the process ulimits and simply have [13:50:43] the daemon fail malloc and restart. [13:51:03] --- [13:51:27] so apparently I justified half the change on there being a limit, then in the next part of the change removed it? :) [13:52:11] and I guess some past version of me figured malloc-fail was a reasonable choice, but didn't bother to make it self-restarting at the init level [13:54:11] anyways, so a few things to unpack into that ticket maybe, beyond the systemd conversion for restarts [13:54:36] (a) We should put some ulimit on it and let it crash earlier (38G of backlogged purges? something is very wrong in the world) [13:55:09] (b) What caused the 38G of backlogged purges? (meaning from vhtcpd's point of view, it couldn't deliver them via HTTP PURGE to the daemon anymore) [13:55:45] (c) Even if we do ulimit to crash it earlier in these cases, we should still have some kind of alerting so this doesn't go unnoticed due to a systemic problem [14:04:23] 10Traffic, 10Operations: Write side of ats-tls named pipe deleted upon logging config change reload - https://phabricator.wikimedia.org/T240950 (10ema) [14:04:35] 10Traffic, 10Operations: Write side of ats-tls named pipe deleted upon logging config change reload - https://phabricator.wikimedia.org/T240950 (10ema) p:05Triage→03Normal [14:11:23] bblack: it would be easy enough to have something export q_size and such to prometheus (even if it was just mtail) [14:13:03] 10Traffic, 10Operations: Write side of ats-tls named pipe deleted upon logging config change reload - https://phabricator.wikimedia.org/T240950 (10ema) > I suspect that reloading ats would break logging at this point. That's not the case, reload works perfectly fine. Changes to the log format are reflected c... [14:20:12] cdanis: well at this point the crashes are more important. Do we have a universal way to alert on daemon crashes? :) [14:20:32] making it a systemd unit does that! [14:23:45] it does? [14:24:27] I think only if we avoid autorestart, right? [14:24:33] then systemctl --failed would alert [14:25:05] hm, yes, I think you're right ema [14:25:35] however I've recently added an icinga check that alerts if there's a given pattern in journalctl [14:26:30] under the glorious name of check_journal_pattern [14:31:33] you could also check for `systemctl show` outputting a ExecMainExitTimestamp field that was within N minutes of present time, that wouldn't be a hard NRPE to write [14:46:57] it seems like a really basic thing, that you'd think we could get less-heuristically [14:47:11] "if a process on a host crashes, that should raise an alert somewhere", just in general [14:48:37] it's hard to tell what a crash is though, in general :) [14:48:45] segfaults are easy to identify [14:49:12] sure [14:49:25] well I guess it depends on what layer we're looking at [14:49:37] segfaults really just SIGSEGV, there are other causes of death of various normality [14:49:37] I mean, the systemctl thing I just suggested isn't really a heuristic; it will be true whenever the unit exited or failed for whatever reason [14:50:29] I guess this is the whole mashup of events vs metrics to me [14:50:37] (which always bugs me, but maybe it's just me) [14:50:53] and how we use metrics to imply events, because that's our pipeline [14:51:28] (sometimes, an event really is defined by a metric. An event like "the request rate spiked above 5000/s") [14:52:17] icinga checks state, prometheus tracks metrics [14:52:32] but there's a third categorical thing here which is just event triggers, like syslog [14:52:54] in a server environment like this, really any segfault anywhere should fire an event alert somewhere [14:53:43] (not because of some systemd metric about a daemon's last exit time, but because a segfault happened and was sent to syslog) [14:55:28] also the daemon exit time does not work for stuff using the supervisor <-> worker model like varnish and trafficserver [14:55:55] the supervisor never dies [14:55:55] we already have some log centralization [14:57:15] we just don't have the right rules firing some alerts on that data [14:57:33] how about: [14:57:34] check_journal_pattern '1 hour ago' 'segfault at' [14:57:35] (and I guess you'd need some notion of acking them or whatever, marking them handled) [14:58:51] root@centrallog1001:~# grep segfault /srv/syslog/syslog.log|head [14:58:51] Dec 17 01:04:36 mw1286 kernel: [16875975.186707] php-fpm7.2[28092]: segfault at 4600001417 ip 00007fdcf54e83a9 sp 00007ffd2a69a9c0 error 4 in libmemcached.so.11.0.0[7fdcf54d7000+30000] [14:58:56] [...] [14:59:22] I'm about to jump in a meeting, mtail on centrallog1001 does indeed track segfaults [14:59:40] where does that end up? [14:59:49] in prometheus [15:00:21] :) [15:00:27] adding alerts for all programs isn't feasible ATM with icinga, although for a few selected programs why not [15:00:31] so events -> metrics -> alerts? :) [15:01:06] bblack: why not using check_journal_pattern? [15:01:23] we don't have a good way to do event-alerting right now, other than artificially through state caused by events (traditional icinga checks) or artificial metrics checks (segfault metric > 0) [15:01:54] I think that's what I'm trying to drive at, but I don't have answers [15:02:35] check_journal_pattern doesn't scale well to universal event rules like segfault [15:02:35] bblack: I have to go now, although would be happy to chat more about it also wrt quarterly planning [15:03:16] I'm trying to imagine a way to use the prometheus data from the syslog mtail to do this [15:03:42] I could see an alert firing if segfault counter globally in the past N hours is > 0 or something? Maybe I don't understand the capabilities well. [15:04:51] and aggregating that by cluster or something before turning it into alerts [15:05:25] so instead of getting spammed badly, you just get an alert that the segfault counter for cluster cache_upload was >0 at some point in the past 24h or something [15:05:42] well that would be useful in general [15:05:46] but then there's still the weird time-domain thing. it has to cover a week maybe [15:06:10] but it's entirely possible for it to just "go away" on its own if nobody pays attention to the alert and eventually the condition ceases due to the time window [15:11:06] anyways, one might argue the specifics about which events are important enough, aggregated in what way [15:11:47] but I think my point is that having an event-driven pipeline of "event X generates alert Y, which must be acked/cleared when someone sees and handles it" doesn't exist in solid form. [15:12:06] (event like syslog line, as opposed to state-checks and metrics values) [15:22:19] XioNoX: I keep forgetting to ask: is it true that ECMP is L3 everywhere now? I wasn't sure without confirming with you (e.g. if I add dual static destinations to something like ns2 in esams, will it hash them on L3 (as opposed to random unhashed, or hash in L4 ports)) [15:23:20] I vaguely recall some past convo that different router hardware and/or firmware had different capabilities and/or we may have only configured it in some places [15:23:34] https://grafana.wikimedia.org/explore?orgId=1&left=%5B%22now-30d%22,%22now%22,%22eqiad%20prometheus%2Fops%22,%7B%22expr%22:%22segfault%7Bbinary%3D%5C%22vhtcpd%5C%22%7D%22,%22context%22:%22explore%22%7D,%7B%22mode%22:%22Metrics%22%7D,%7B%22ui%22:%5Btrue,true,true,%22none%22%5D%7D%5D [15:24:14] bblack: L3 everywhere: https://github.com/wikimedia/operations-homer-public/blob/4dbf7010d1d4b4e9e5354836f779e9791384a45d/templates/cr.conf#L28 [15:28:45] cdanis: fascinating, especially if you remove the binary match and look at longer patterns! [15:28:49] (also scary) [15:28:58] the php-fpm segfaults sure are something [15:29:13] lots of others too though [15:29:29] what's with all the lilypond ones? [15:29:47] I don't even know what that is [15:30:02] ... musical rendering? [15:30:11] yeah that's what google says anyways [15:30:59] we apparently segv apache2 more often than I thought [17:00:46] bblack: FYI I said I could email dell about the network 10G nics, but they already know we want everything delivered before end of december, and before end of this week if possible [17:00:56] so my email would be repetitive to what they already know [17:01:14] I do realize the backstory to why we are ordering this, and having them in before break is preferred [17:01:28] (so i dont mind the repetitive email, will send now that i thought about why we need these ;)