[01:03:32] 10Traffic, 10Operations: Turn up network links for Asia Cache DC - https://phabricator.wikimedia.org/T156031#4037248 (10ayounsi) [01:03:35] 10Traffic, 10Operations: Network hardware configuration for Asia Cache DC - https://phabricator.wikimedia.org/T162684#4037246 (10ayounsi) 05Open>03Resolved Devices added to Rancid & monitoring We're all done here. [01:04:14] 10Traffic, 10Operations: Enable Service in Asia Cache DC - https://phabricator.wikimedia.org/T156026#4037252 (10ayounsi) [01:04:16] 10Traffic, 10Operations: Turn up network links for Asia Cache DC - https://phabricator.wikimedia.org/T156031#2962044 (10ayounsi) 05Open>03Resolved a:03ayounsi Transit, Transport, and Peering are up. [01:52:47] 10Traffic, 10Operations, 10Performance-Team (Radar): Define turn-up process and scope for eqsin service to regional countries - https://phabricator.wikimedia.org/T189252#4037302 (10Krinkle) [03:36:13] 10Traffic, 10Operations, 10wikitech.wikimedia.org, 10Patch-For-Review, 10cloud-services-team (Kanban): How to purge misc-web varnishes for wikitech changes? - https://phabricator.wikimedia.org/T189168#4037359 (10Andrew) 05Open>03Resolved [04:45:56] 10Domains, 10Traffic, 10Operations, 10WMF-Design, and 3 others: Create subdomain for Design and Wikimedia User Interface Style Guide - https://phabricator.wikimedia.org/T185282#3911827 (10Prtksxna) Requested {T189279} too. [04:53:17] 10Domains, 10Traffic, 10Operations, 10WMF-Design, and 3 others: Create subdomain for Design and Wikimedia User Interface Style Guide - https://phabricator.wikimedia.org/T185282#3911827 (10Gryllida) Jumping out of context here, but it could be nice to have the new site multi-lingual unlike what https://wiki... [04:56:52] 10Domains, 10Traffic, 10Operations, 10WMF-Design, and 3 others: Create subdomain for Design and Wikimedia User Interface Style Guide - https://phabricator.wikimedia.org/T185282#4037419 (10Volker_E) @Gryllida That is one of our own quests and is discussed in T164449. Please don't side-rail tasks, but rather... [08:05:07] 10Traffic, 10Operations, 10Pybal, 10Patch-For-Review: Pybal stuck at BGP state OPENSENT while the other peer reached ESTABLISHED - https://phabricator.wikimedia.org/T188085#4037541 (10Vgutierrez) Checking cr2-eqiad BGP neighbor information, I realized that for lvs1006 it's showing an Open Message Error tha... [09:35:01] 10Traffic, 10Operations, 10Pybal, 10Patch-For-Review: Pybal stuck at BGP state OPENSENT while the other peer reached ESTABLISHED - https://phabricator.wikimedia.org/T188085#4037641 (10Vgutierrez) rechecking logs on lvs1006.wikimedia.org shows the following output regarding bgp for Feb 22nd: ``` vgutierrez@... [09:35:08] oh god... [09:35:14] I feel like an idiot :/ [09:36:46] but I think that we can close that now [09:47:41] so what we're saying is that `journalctl -u pybal` and pybal.log differ, basically [09:47:55] yup [09:47:56] which is like /o\ [09:48:11] a lot [09:48:13] ffs [09:51:39] ema: so.. shall we close T188085? [09:51:40] T188085: Pybal stuck at BGP state OPENSENT while the other peer reached ESTABLISHED - https://phabricator.wikimedia.org/T188085 [09:52:45] close how? [09:52:49] how is it resolved? [09:53:08] there wasn't an issue to resolve [09:53:19] at least not on pybal BGP implementation [09:53:29] check https://phabricator.wikimedia.org/T188085#4037641 [09:54:03] aha [09:54:06] we were just missing log lines [09:54:15] ok hehe [09:54:22] right :_( [09:54:51] yay systemd? [09:58:31] i guess we can close it as invalid ;p [10:01:23] yay buffering, more likely [10:01:26] https://serverfault.com/questions/832691/view-unbuffered-log-output-from-journalctl [10:17:23] I don't get it, buffering make some messages go away? [10:18:02] Feb 22 09:06:30 lvs1006 pybal[26025]: [bgp] INFO: State is now: OPENSENT [10:18:05] Feb 22 09:13:12 lvs1006 pybal[26025]: [bgp] INFO: State is now: IDLE [10:18:11] from 09:06:30 there are several ones missing [10:18:47] oh, some are *missing* altogether [10:18:50] sorry I didn't get that [10:19:30] how does all this relate to yesterday's "slow" vs "fast" pybal restarts? [10:19:50] no relation at all [10:20:42] pybal doesn't wait to establish the BGP session once it (pybal) is up [10:20:52] so on fast restarts in can stress some BGP peers [10:20:58] s/in/it/g [10:22:03] I thought you said that quickly stopping/starting pybal was related to T188085, but I might very well misremember [10:22:04] T188085: Pybal stuck at BGP state OPENSENT while the other peer reached ESTABLISHED - https://phabricator.wikimedia.org/T188085 [10:22:18] yey, I thought that yesterday [10:23:05] and reading more pybal code I realised if that was the case, the connection should be closed (and logged as closed) [10:23:12] ok [10:23:23] meanwhile I've tried diffing today's pybal.log with journalctl -u pybal --since=today on lvs1006 [10:23:44] so I went to lvs1006 to read more pybal logs.. and I found this [10:23:48] ema: and...? [10:23:48] no lines missing, but a small minority were logged 1 sec later [10:25:26] the datetime format is a bit different, pybal.log has 'Mar 9 00:00:09 [...]' while journalctl has 'Mar 09 00:00:09 [...]' [10:26:37] out of 7798 log entries, 17 were logged 1 second later on the journal compared to pybal.log [10:27:11] how much is logged for pybal? maybe the rate limits of journald kick in? see journald.conf(5) [10:27:19] RateLimitIntervalSec and RateLimitBurst [10:28:10] Defaults to 1000 messages in 30s [10:28:35] are we reaching that? [10:28:37] vgutierrez@lvs1006:~$ fgrep "09:06:30" pybal.log.15 |wc -l [10:28:37] 397 [10:28:54] 400 messages in one second... [10:29:24] so it's pybal's fault for not shipping a sane journald config :) [10:30:16] mai una gioia, as we say in .it [10:30:24] xDDDDDDD [10:31:03] I laugh.. otherwise I'd kill myself or something [10:31:16] I've chasing a ghost bug for 2 weeks [10:31:55] yeah this was just part of the onboarding process [10:32:00] we knew [10:32:08] * mark puts away the popcorn [10:32:12] back to work now :( [10:32:15] honestly I'd prefer that xD [10:32:40] I'd give you the troll of the year award and move on [10:34:10] on the bright side, now we have BGP monitoring on icinga [10:34:36] and on grafana [10:34:56] and you're a proficient pybal maintainer by now! [10:35:08] yey.. and I learned a lot on BGP [10:35:35] but I'm pretty angry right now /o\ [10:35:41] that was a hot potato ema was more than willing to pass off ;p [10:35:53] (do google "hot potato routing" btw...) [10:48:04] 10Traffic, 10Operations, 10Pybal, 10Patch-For-Review: Pybal stuck at BGP state OPENSENT while the other peer reached ESTABLISHED - https://phabricator.wikimedia.org/T188085#4037730 (10Vgutierrez) 05Open>03Invalid a:03Vgutierrez [11:02:46] 10Traffic, 10Operations, 10Pybal: Tune systemd journal rate limiting for PyBal - https://phabricator.wikimedia.org/T189290#4037747 (10Vgutierrez) [11:09:06] 10Traffic, 10Operations, 10Pybal: Tune systemd journal rate limiting for PyBal - https://phabricator.wikimedia.org/T189290#4037765 (10Vgutierrez) p:05Triage>03Normal [12:31:37] systemd strikes again? who needs logging they can trust anyways? [12:33:25] omg just read the bug [12:34:11] 10Traffic, 10netops, 10Operations, 10ops-eqsin: Setup eqsin RIPE Atlas anchor - https://phabricator.wikimedia.org/T179042#4037912 (10BBlack) Oh makes sense, maybe the initial image install just has the v4 and RIPE has to configure the v6 during their bringup process? [12:34:50] bblack: yes, I've seen this happen before with the other ones [12:34:59] (only v4 in the provisioning image) [12:35:47] 10Traffic, 10netops, 10Operations, 10ops-eqsin: Setup eqsin RIPE Atlas anchor - https://phabricator.wikimedia.org/T179042#4037931 (10faidon) That is correct to my knowledge -- that was the case with our other anchors. [12:35:48] so [12:36:36] perhaps we should file a bug against systemd to ask to add to the log a line "" or something [12:49:14] hmmm so level3 esams link is down and akced, and at a glance the morning esams fetch-failed spike is missing this morning? [12:50:04] or at least, greatly minimized [12:50:16] 2day esams fetchfail to compare yesterday and today: [12:50:18] https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?orgId=1&from=now-2d&to=now&var-datasource=esams%20prometheus%2Fops&var-cache_type=text&var-server=All [12:51:03] lemme go peek at librenms and see if I understand the level3 thing right... [12:52:55] oh hmm, level3 link did go down (what I noticed earlier in backscrolls), but it was already back online before the usual morning spike [12:53:07] still, odd. maybe they fixed something :) [12:53:35] the emails we got pointed to some errors [12:53:39] or maybe flaps? [12:53:48] but I haven't really looked at it honestly [12:54:18] I looked at this at the network layer a few weeks back, hoping for some link or bgp state flapping or other anomaly around EU morning times [12:54:49] I didn't see anything before, and this has been a problem going quite a while back. lately I've been operating on the assumption it's not a network-layer issue. [12:55:15] but it is curiously much-better today after level3 link was outaged for a while and then turned back on... [12:56:28] arzhel's email says the proximate cause of the L3 link going down was a physical incident damaging the fiber [12:57:11] maybe just coincidence, I donno [13:54:59] paravoid: that's already there according to the documentation [13:55:28] paravoid: but of course it was filtered with "|grep bgp" [13:55:55] oh, ouch [13:56:18] it should say something like "Jan 9 09:18:07 server1 journal: Suppressed 7124 messages from /system.slice/named.service" [13:57:50] now I have a new interview question O:) [14:02:33] haha [14:21:14] lovely: https://manpages.debian.org/jessie/systemd/journald.conf.5.en.html VS https://manpages.debian.org/stretch/systemd/journald.conf.5.en.html [14:21:35] jessie: RateLimitInterval, strech: RateLimitIntervalSec [14:21:55] but strech option still allows minutes, hours or whatever unit you want ¬¬ [14:54:49] 10Traffic, 10Operations, 10Performance-Team (Radar): Define turn-up process and scope for eqsin service to regional countries - https://phabricator.wikimedia.org/T189252#4038250 (10BBlack) [14:56:27] 10Traffic, 10Operations, 10Performance-Team (Radar): Define turn-up process and scope for eqsin service to regional countries - https://phabricator.wikimedia.org/T189252#4036653 (10BBlack) Updated with actual target country lists above. Process and batching of this for actual turn-up work still TODO :) [14:58:42] 10Traffic, 10Operations: WP Zero workarounds for eqsin - https://phabricator.wikimedia.org/T189250#4038254 (10BBlack) [15:20:23] issue booting up cp3034: https://phabricator.wikimedia.org/P6826 [15:22:07] it did boot fine after a powercycle though [15:33:13] 10Traffic, 10Operations, 10ops-esams: cp3034: Uncorrectable Memory Error - https://phabricator.wikimedia.org/T189305#4038347 (10ema) [15:33:49] 10Traffic, 10Operations, 10ops-esams: cp3034: Uncorrectable Memory Error - https://phabricator.wikimedia.org/T189305#4038382 (10ema) p:05Triage>03Normal [15:35:24] it's been complaining for a while, right? [15:36:27] yup [15:47:36] 10Traffic, 10Operations, 10Pybal: Tune systemd journal rate limiting for PyBal - https://phabricator.wikimedia.org/T189290#4038412 (10Vgutierrez) pybal emits 854 messages during a restart in lvs1006. Also during a restart is when appears to log at its fastest rate, achieving almost 400 lines per second: ```v... [15:58:41] 10Traffic, 10Operations, 10Pybal: Tune systemd journal rate limiting for PyBal - https://phabricator.wikimedia.org/T189290#4038450 (10Vgutierrez) number of lines logged on a restart it's directly proportional to the number of services configured, lvs1010 appears to be the pybal instance with more services co... [15:59:12] moritzm: it would be terribly bad if we disable the rate limiting for pybal? [15:59:57] let me put it this way [16:00:08] if pybal is emitting so many log messages that systemd is having a problem with it [16:00:21] that system isn't being particularly contributing to our availability at that point anymore anyway :P [16:00:43] indeed [16:01:43] plus how else would valentin get his revenge? [16:01:43] vgutierrez: not at all [16:03:46] ema: we should depool 3034 if it's behaving like that [16:04:23] 10Traffic, 10Operations, 10ops-esams: cp3034: Uncorrectable Memory Error - https://phabricator.wikimedia.org/T189305#4038347 (10BBlack) See also T183177 (why aren't we getting runtime icinga alerts when these happen, via EDAC?) [16:08:53] 10Traffic, 10Operations, 10ops-esams: cp3034: Uncorrectable Memory Error - https://phabricator.wikimedia.org/T189305#4038496 (10BBlack) Also, depooled for now, since we can't trust the uncorrected memory errors not causing production issues: `16:07 <+logmsgbot> !log bblack@neodymium conftool action : set/poo... [16:09:30] bblack: yup, +1 [17:31:47] 8/win 28