[01:03:32] <wikibugs>	 10Traffic, 10Operations: Turn up network links for Asia Cache DC - https://phabricator.wikimedia.org/T156031#4037248 (10ayounsi)
[01:03:35] <wikibugs>	 10Traffic, 10Operations: Network hardware configuration for Asia Cache DC - https://phabricator.wikimedia.org/T162684#4037246 (10ayounsi) 05Open>03Resolved Devices added to Rancid & monitoring  We're all done here.
[01:04:14] <wikibugs>	 10Traffic, 10Operations: Enable Service in Asia Cache DC - https://phabricator.wikimedia.org/T156026#4037252 (10ayounsi)
[01:04:16] <wikibugs>	 10Traffic, 10Operations: Turn up network links for Asia Cache DC - https://phabricator.wikimedia.org/T156031#2962044 (10ayounsi) 05Open>03Resolved a:03ayounsi Transit, Transport, and Peering are up.
[01:52:47] <wikibugs>	 10Traffic, 10Operations, 10Performance-Team (Radar): Define turn-up process and scope for eqsin service to regional countries - https://phabricator.wikimedia.org/T189252#4037302 (10Krinkle)
[03:36:13] <wikibugs>	 10Traffic, 10Operations, 10wikitech.wikimedia.org, 10Patch-For-Review, 10cloud-services-team (Kanban): How to purge misc-web varnishes for wikitech changes? - https://phabricator.wikimedia.org/T189168#4037359 (10Andrew) 05Open>03Resolved
[04:45:56] <wikibugs>	 10Domains, 10Traffic, 10Operations, 10WMF-Design, and 3 others: Create subdomain for Design and Wikimedia User Interface Style Guide - https://phabricator.wikimedia.org/T185282#3911827 (10Prtksxna) Requested {T189279} too.
[04:53:17] <wikibugs>	 10Domains, 10Traffic, 10Operations, 10WMF-Design, and 3 others: Create subdomain for Design and Wikimedia User Interface Style Guide - https://phabricator.wikimedia.org/T185282#3911827 (10Gryllida) Jumping out of context here, but it could be nice to have the new site multi-lingual unlike what https://wiki...
[04:56:52] <wikibugs>	 10Domains, 10Traffic, 10Operations, 10WMF-Design, and 3 others: Create subdomain for Design and Wikimedia User Interface Style Guide - https://phabricator.wikimedia.org/T185282#4037419 (10Volker_E) @Gryllida That is one of our own quests and is discussed in T164449. Please don't side-rail tasks, but rather...
[08:05:07] <wikibugs>	 10Traffic, 10Operations, 10Pybal, 10Patch-For-Review: Pybal stuck at BGP state OPENSENT while the other peer reached ESTABLISHED - https://phabricator.wikimedia.org/T188085#4037541 (10Vgutierrez) Checking cr2-eqiad BGP neighbor information, I realized that for lvs1006 it's showing an Open Message Error tha...
[09:35:01] <wikibugs>	 10Traffic, 10Operations, 10Pybal, 10Patch-For-Review: Pybal stuck at BGP state OPENSENT while the other peer reached ESTABLISHED - https://phabricator.wikimedia.org/T188085#4037641 (10Vgutierrez) rechecking logs on lvs1006.wikimedia.org shows the following output regarding bgp for Feb 22nd: ``` vgutierrez@...
[09:35:08] <vgutierrez>	 oh god...
[09:35:14] <vgutierrez>	 I feel like an idiot :/
[09:36:46] <vgutierrez>	 but I think that we can close that now
[09:47:41] <ema>	 so what we're saying is that `journalctl -u pybal` and pybal.log differ, basically
[09:47:55] <vgutierrez>	 yup
[09:47:56] <ema>	 which is like /o\
[09:48:11] <vgutierrez>	 a lot
[09:48:13] <ema>	 ffs
[09:51:39] <vgutierrez>	 ema: so.. shall we close T188085?
[09:51:40] <stashbot>	 T188085: Pybal stuck at BGP state OPENSENT while the other peer reached ESTABLISHED - https://phabricator.wikimedia.org/T188085
[09:52:45] <mark>	 close how?
[09:52:49] <mark>	 how is it resolved?
[09:53:08] <vgutierrez>	 there wasn't an issue to resolve
[09:53:19] <vgutierrez>	 at least not on pybal BGP implementation
[09:53:29] <vgutierrez>	 check https://phabricator.wikimedia.org/T188085#4037641
[09:54:03] <mark>	 aha
[09:54:06] <mark>	 we were just missing log lines
[09:54:15] <mark>	 ok hehe
[09:54:22] <vgutierrez>	 right :_(
[09:54:51] <mark>	 yay systemd?
[09:58:31] <mark>	 i guess we can close it as invalid ;p
[10:01:23] <ema>	 yay buffering, more likely
[10:01:26] <ema>	 https://serverfault.com/questions/832691/view-unbuffered-log-output-from-journalctl
[10:17:23] <vgutierrez>	 I don't get it, buffering make some messages go away?
[10:18:02] <vgutierrez>	 Feb 22 09:06:30 lvs1006 pybal[26025]: [bgp] INFO: State is now: OPENSENT                                                                                      
[10:18:05] <vgutierrez>	 Feb 22 09:13:12 lvs1006 pybal[26025]: [bgp] INFO: State is now: IDLE 
[10:18:11] <vgutierrez>	 from 09:06:30 there are several ones missing
[10:18:47] <ema>	 oh, some are *missing* altogether
[10:18:50] <ema>	 sorry I didn't get that
[10:19:30] <ema>	 how does all this relate to yesterday's "slow" vs "fast" pybal restarts?
[10:19:50] <vgutierrez>	 no relation at all
[10:20:42] <vgutierrez>	 pybal doesn't wait to establish the BGP session once it (pybal) is up
[10:20:52] <vgutierrez>	 so on fast restarts in can stress some BGP peers
[10:20:58] <vgutierrez>	 s/in/it/g
[10:22:03] <ema>	 I thought you said that quickly stopping/starting pybal was related to T188085, but I might very well misremember
[10:22:04] <stashbot>	 T188085: Pybal stuck at BGP state OPENSENT while the other peer reached ESTABLISHED - https://phabricator.wikimedia.org/T188085
[10:22:18] <vgutierrez>	 yey, I thought that yesterday
[10:23:05] <vgutierrez>	 and reading more pybal code I realised if that was the case, the connection should be closed (and logged as closed)
[10:23:12] <ema>	 ok
[10:23:23] <ema>	 meanwhile I've tried diffing today's pybal.log with journalctl -u pybal --since=today on lvs1006
[10:23:44] <vgutierrez>	 so I went to lvs1006 to read more pybal logs.. and I found this
[10:23:48] <vgutierrez>	 ema: and...?
[10:23:48] <ema>	 no lines missing, but a small minority were logged 1 sec later 
[10:25:26] <ema>	 the datetime format is a bit different, pybal.log has 'Mar  9 00:00:09 [...]' while journalctl has 'Mar 09 00:00:09 [...]'
[10:26:37] <ema>	 out of 7798 log entries, 17 were logged 1 second later on the journal compared to pybal.log 
[10:27:11] <moritzm>	 how much is logged for pybal? maybe the rate limits of journald kick in? see journald.conf(5)
[10:27:19] <moritzm>	 RateLimitIntervalSec and RateLimitBurst
[10:28:10] <vgutierrez>	  Defaults to 1000 messages in 30s
[10:28:35] <moritzm>	 are we reaching that?
[10:28:37] <vgutierrez>	 vgutierrez@lvs1006:~$ fgrep "09:06:30" pybal.log.15 |wc -l
[10:28:37] <vgutierrez>	 397
[10:28:54] <vgutierrez>	 400 messages in one second...
[10:29:24] <moritzm>	 so it's pybal's fault for not shipping a sane journald config :)
[10:30:16] <ema>	 mai una gioia, as we say in .it
[10:30:24] <vgutierrez>	 xDDDDDDD
[10:31:03] <vgutierrez>	 I laugh.. otherwise I'd kill myself or something
[10:31:16] <vgutierrez>	 I've chasing a ghost bug for 2 weeks
[10:31:55] <ema>	 yeah this was just part of the onboarding process
[10:32:00] <ema>	 we knew
[10:32:08] * mark puts away the popcorn
[10:32:12] <mark>	 back to work now :(
[10:32:15] <vgutierrez>	 honestly I'd prefer that xD
[10:32:40] <vgutierrez>	 I'd give you the troll of the year award and move on
[10:34:10] <vgutierrez>	 on the bright side, now we have BGP monitoring on icinga
[10:34:36] <mark>	 and on grafana
[10:34:56] <ema>	 and you're a proficient pybal maintainer by now!
[10:35:08] <vgutierrez>	 yey.. and I learned a lot on BGP
[10:35:35] <vgutierrez>	 but I'm pretty angry right now /o\
[10:35:41] <mark>	 that was a hot potato ema was more than willing to pass off ;p
[10:35:53] <mark>	 (do google "hot potato routing" btw...)
[10:48:04] <wikibugs>	 10Traffic, 10Operations, 10Pybal, 10Patch-For-Review: Pybal stuck at BGP state OPENSENT while the other peer reached ESTABLISHED - https://phabricator.wikimedia.org/T188085#4037730 (10Vgutierrez) 05Open>03Invalid a:03Vgutierrez
[11:02:46] <wikibugs>	 10Traffic, 10Operations, 10Pybal: Tune systemd journal rate limiting for PyBal - https://phabricator.wikimedia.org/T189290#4037747 (10Vgutierrez)
[11:09:06] <wikibugs>	 10Traffic, 10Operations, 10Pybal: Tune systemd journal rate limiting for PyBal - https://phabricator.wikimedia.org/T189290#4037765 (10Vgutierrez) p:05Triage>03Normal
[12:31:37] <bblack>	 systemd strikes again? who needs logging they can trust anyways?
[12:33:25] <paravoid>	 omg just read the bug
[12:34:11] <wikibugs>	 10Traffic, 10netops, 10Operations, 10ops-eqsin: Setup eqsin RIPE Atlas anchor - https://phabricator.wikimedia.org/T179042#4037912 (10BBlack) Oh makes sense, maybe the initial image install just has the v4 and RIPE has to configure the v6 during their bringup process?
[12:34:50] <paravoid>	 bblack: yes, I've seen this happen before with the other ones
[12:34:59] <paravoid>	 (only v4 in the provisioning image)
[12:35:47] <wikibugs>	 10Traffic, 10netops, 10Operations, 10ops-eqsin: Setup eqsin RIPE Atlas anchor - https://phabricator.wikimedia.org/T179042#4037931 (10faidon) That is correct to my knowledge -- that was the case with our other anchors.
[12:35:48] <paravoid>	 so
[12:36:36] <paravoid>	 perhaps we should file a bug against systemd to ask to add to the log a line "<rate-limit hit, truncated>" or something
[12:49:14] <bblack>	 hmmm so level3 esams link is down and akced, and at a glance the morning esams fetch-failed spike is missing this morning?
[12:50:04] <bblack>	 or at least, greatly minimized
[12:50:16] <bblack>	 2day esams fetchfail to compare yesterday and today:
[12:50:18] <bblack>	 https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?orgId=1&from=now-2d&to=now&var-datasource=esams%20prometheus%2Fops&var-cache_type=text&var-server=All
[12:51:03] <bblack>	 lemme go peek at librenms and see if I understand the level3 thing right...
[12:52:55] <bblack>	 oh hmm, level3 link did go down (what I noticed earlier in backscrolls), but it was already back online before the usual morning spike
[12:53:07] <bblack>	 still, odd.  maybe they fixed something :)
[12:53:35] <paravoid>	 the emails we got pointed to some errors
[12:53:39] <paravoid>	 or maybe flaps?
[12:53:48] <paravoid>	 but I haven't really looked at it honestly
[12:54:18] <bblack>	 I looked at this at the network layer a few weeks back, hoping for some link or bgp state flapping or other anomaly around EU morning times
[12:54:49] <bblack>	 I didn't see anything before, and this has been a problem going quite a while back.  lately I've been operating on the assumption it's not a network-layer issue.
[12:55:15] <bblack>	 but it is curiously much-better today after level3 link was outaged for a while and then turned back on...
[12:56:28] <bblack>	 arzhel's email says the proximate cause of the L3 link going down was a physical incident damaging the fiber
[12:57:11] <bblack>	 maybe just coincidence, I donno
[13:54:59] <vgutierrez>	 paravoid: that's already there according to the documentation
[13:55:28] <vgutierrez>	 paravoid: but of course it was filtered with "|grep bgp"
[13:55:55] <paravoid>	 oh, ouch
[13:56:18] <vgutierrez>	 it should say something like "Jan  9 09:18:07 server1 journal: Suppressed 7124 messages from /system.slice/named.service"
[13:57:50] <vgutierrez>	 now I have a new interview question O:)
[14:02:33] <moritzm>	 haha
[14:21:14] <vgutierrez>	 lovely: https://manpages.debian.org/jessie/systemd/journald.conf.5.en.html VS https://manpages.debian.org/stretch/systemd/journald.conf.5.en.html
[14:21:35] <vgutierrez>	 jessie: RateLimitInterval, strech: RateLimitIntervalSec
[14:21:55] <vgutierrez>	 but strech option still allows minutes, hours or whatever unit you want ¬¬
[14:54:49] <wikibugs>	 10Traffic, 10Operations, 10Performance-Team (Radar): Define turn-up process and scope for eqsin service to regional countries - https://phabricator.wikimedia.org/T189252#4038250 (10BBlack)
[14:56:27] <wikibugs>	 10Traffic, 10Operations, 10Performance-Team (Radar): Define turn-up process and scope for eqsin service to regional countries - https://phabricator.wikimedia.org/T189252#4036653 (10BBlack) Updated with actual target country lists above.  Process and batching of this for actual turn-up work still TODO :)
[14:58:42] <wikibugs>	 10Traffic, 10Operations: WP Zero workarounds for eqsin - https://phabricator.wikimedia.org/T189250#4038254 (10BBlack)
[15:20:23] <ema>	 issue booting up cp3034: https://phabricator.wikimedia.org/P6826
[15:22:07] <ema>	 it did boot fine after a powercycle though
[15:33:13] <wikibugs>	 10Traffic, 10Operations, 10ops-esams: cp3034: Uncorrectable Memory Error - https://phabricator.wikimedia.org/T189305#4038347 (10ema)
[15:33:49] <wikibugs>	 10Traffic, 10Operations, 10ops-esams: cp3034: Uncorrectable Memory Error - https://phabricator.wikimedia.org/T189305#4038382 (10ema) p:05Triage>03Normal
[15:35:24] <vgutierrez>	 it's been complaining for a while, right?
[15:36:27] <ema>	 yup
[15:47:36] <wikibugs>	 10Traffic, 10Operations, 10Pybal: Tune systemd journal rate limiting for PyBal - https://phabricator.wikimedia.org/T189290#4038412 (10Vgutierrez) pybal emits 854 messages during a restart in lvs1006. Also during a restart is when appears to log at its fastest rate, achieving almost 400 lines per second: ```v...
[15:58:41] <wikibugs>	 10Traffic, 10Operations, 10Pybal: Tune systemd journal rate limiting for PyBal - https://phabricator.wikimedia.org/T189290#4038450 (10Vgutierrez) number of lines logged on a restart it's directly proportional to the number of services configured, lvs1010 appears to be the pybal instance with more services co...
[15:59:12] <vgutierrez>	 moritzm: it would be terribly bad if we disable the rate limiting for pybal?
[15:59:57] <mark>	 let me put it this way
[16:00:08] <mark>	 if pybal is emitting so many log messages that systemd is having a problem with it
[16:00:21] <mark>	 that system isn't being particularly contributing to our availability at that point anymore anyway :P
[16:00:43] <vgutierrez>	 indeed
[16:01:43] <mark>	 plus how else would valentin get his revenge?
[16:01:43] <moritzm>	 vgutierrez: not at all
[16:03:46] <bblack>	 ema: we should depool 3034 if it's behaving like that
[16:04:23] <wikibugs>	 10Traffic, 10Operations, 10ops-esams: cp3034: Uncorrectable Memory Error - https://phabricator.wikimedia.org/T189305#4038347 (10BBlack) See also T183177 (why aren't we getting runtime icinga alerts when these happen, via EDAC?)
[16:08:53] <wikibugs>	 10Traffic, 10Operations, 10ops-esams: cp3034: Uncorrectable Memory Error - https://phabricator.wikimedia.org/T189305#4038496 (10BBlack) Also, depooled for now, since we can't trust the uncorrected memory errors not causing production issues: `16:07 <+logmsgbot> !log bblack@neodymium conftool action : set/poo...
[16:09:30] <ema>	 bblack: yup, +1
[17:31:47] <paravoid>	 8/win 28