[06:56:36] morning! [06:57:31] I see some OSPF alarms in icinga, it seems for cr2-eqdfw to cr1-eqiad and cr3-knams.. Both links are GTT, so possibly maintenance on eqdfw side [06:59:40] but I don't find anything schedule, nor advertised as incident report [07:05:20] on cr2-eqdfw the interface's optics looks sane [07:05:21] Laser output power : 0.7010 mW / -1.54 dBm [07:05:29] Laser receiver power : 0.2828 mW / -5.49 dBm [07:06:32] same thing on the cr1-eqiad side [07:06:35] (to compare) [07:09:23] so I guess that the next step is to send an email to the GTT's NOC to ask for some checks on their side? [07:15:08] yup, that sounds right [07:18:03] ok prepping something :) [07:20:47] ah wait the cirtuit is the same, VPLS is mentioned [07:21:24] mmm but it seems weird [07:21:32] xe-0/1/3.23 up up Transport: cr3-knams:xe-0/1/5.23 (GTT, 680970, 110ms) [10Gbps VPLS] [07:21:36] xe-0/1/3.12 up up Transport: cr1-eqiad:xe-4/2/2.12 (GTT, 680970, 36ms) [10Gbps VPLS] [07:22:05] on netbox I have another one https://netbox.wikimedia.org/circuits/circuits/35/ [07:22:16] so probably the description on the router is not correct [07:23:57] but maybe the netbox one is a different one for cr3-knams, checking [07:24:31] xe-0/1/5.23 up up Transport: cr2-eqdfw:xe-0/1/3.23 (GTT, 680989, 110ms [07:24:34] nono ok [07:28:39] so interestingly, I don't see in router's logs the interface going down [07:28:57] but the OSPF session does, possibly due to BFD [07:32:00] also I just noticed that we have a GTT link between cr1-eqiad and cr3-knams [07:32:14] and now only knams shows 2 BFD alerts [07:32:31] so the issue could be on the knams side as well [07:33:21] but in knams we have https://phabricator.wikimedia.org/T240659 [07:34:47] that is now resolved of course [07:41:48] I am going to grab another coffee and then I'll recheck logs :) [07:49:07] so first occurrence on knams side [07:49:09] Feb 11 03:07:56 cr3-knams bfdd[15104]: BFD Session 208.80.153.217 (IFL 95) state Up -> Down LD/RD(17/25) Up time:12:28:48 Local diag: CtlExpire Remote diag: None Reason: Detect Timer Expiry. [07:51:48] or cr1-eqiad I don't find logs on that timeframe, weird [07:53:33] first occurrence on cr2-eqdfw [07:53:34] Feb 11 03:07:56 cr2-eqdfw bfdd[15088]: BFD Session 208.80.153.216 (IFL 84) state Up -> Down LD/RD(25/17) Up time:12:28:50 Local diag: NbrSignal Remote diag: CtlExpire Reason: Received DOWN from PEER. [07:55:05] ok re-checked, and alarms for OSPF flapping for cr1 are present [07:56:42] first log in there is [07:56:43] Feb 11 05:15:04 re0.cr1-eqiad rpd[5239]: BGP_RESET_PENDING_CONNECTION: 91.198.174.251 (External AS 65003): reseting pending active connection [07:56:58] so I guess that the problem has been on the knams side, will report it in the task [07:57:16] Cc: XioNoX: (not sure if what I wrote makes sense :D) [11:16:26] flapping again, but afaics the interface doesn't go down [11:16:33] it is bfd flapping [11:16:41] and causing the ospf alerts [11:50:53] looking [11:54:44] elukey: so 1/ the circuit is an evpn circuit, so the interface state is only between us and GTT [11:55:44] 2/ BFD is only used by the routing protocols (BGP/OSPF) [11:55:51] for fast failover [11:57:13] yes yes, my point was that I didn't see the interface down as it happens sometimes when the link is down for maintenance [11:57:27] but only that bfd was flapping, together with is friend OSPF [11:57:46] so I haven't send any email to GTT due to the outstanding bug/feature/issue on cr3-knams [11:58:07] [11:58:31] elukey: so here we have 2 options either it's the same bug as https://phabricator.wikimedia.org/T240659, but previously BGD was going down on one side only (the opposite side of knams) and OSPF was staying up [11:58:41] or it's a provider issue [11:59:29] yep I had the same idea, but I wanted to wait for the network master before proceeding :D [11:59:30] link down would be very unlikely because it means something is wrong between our router and their router, usually in the same DC [11:59:47] ah ok [12:00:01] but it could happen if say they were doing maintenance on the link no? [12:00:06] (interface down) [12:10:30] yes [12:40:17] elukey: my guess here is a provider issue [12:41:10] as it's flaps, while the issue in T240659 is a one time thing [12:41:11] T240659: BFD session alerts due to inconsistent status on cr3-knams - https://phabricator.wikimedia.org/T240659 [12:43:16] should we contact then the GTT's noc? [12:45:21] elukey: I think it stopped 1h ago? [12:45:28] if it keeps happening then yes [12:47:08] XioNoX: sure it stopped but a follow up would be nice, but let's wait another occurrence [12:47:24] if so we can collect data and send it to their noc [14:18:14] so I'm looking at pulling together global aggregates in prometheus for the authdns stuff [14:18:42] I get the part that's in modules/role/manifests/prometheus/global.pp with the regex addition there to pull gdnsd_* [14:18:53] the rules/ops.yaml part is less-clear to me [14:19:02] err its modules/role/files/prometheus/rules_ops.yml [14:19:38] is rules_ops just to pre-sum global stuff to make grafana more efficient? [14:20:33] bblack: is that for the NTT form? [14:20:44] no, just in general [14:21:00] right now we only have per-DC authdns metrics in grafana, but not a global aggregation to look at [14:21:16] it's been on my back burner of little things to remember for a while [14:21:33] bblack: I tried to find how much pps/mbps we were doing per authdns IP, is that on an existing graph? [14:22:19] bblack: that's correct yeah, pre-aggregate in rules_ops.yml and then the regexp in global.pp to pick up the aggregates [14:22:59] so what I've got now is a grafana panel with a dropdown for the source https://grafana.wikimedia.org/d/Jj8MztfZz/authoritative-dns?orgId=1 [14:23:02] bblack: if you were following https://wikitech.wikimedia.org/wiki/Prometheus#Aggregate_metrics_from_multiple_sites please LMK if there's something missing [14:23:07] and a dropdown to pick a subset of servers [14:23:35] I just want picking one of the /globals to give me the broader aggregate across all sites, and I guess still with the server selector working [14:23:50] do I need to go define a global aggregation for each individual stat? [14:24:36] bblack: if there aren't "many" gdns metrics you can ask global prometheus to pull everything in, without aggregates then the server selection will work too [14:25:21] :) [14:27:01] XioNoX: mbps isn't, pps could be roughly inferred from the query rate [14:27:12] of course there's the normal host-level metrics too for things like mbps [14:27:47] bblack: now dns1001/1002 are doing authdns or authdns1001 too? [14:28:22] host level metrics also include rec-dns if dns1001/1002 [14:31:25] it's a little tricky, yes [14:31:55] right now in eqiad and codfw, the authdnsNNNN box is doing the outside traffic, and dnsN00[12] are handling the backing for the rec-dns [14:32:09] but in esams there's only dns300[12], and they're doing both jobs [14:33:17] ok, so I can just look at authdns1001/2001 to get my numbers [14:33:35] each is different [14:34:06] really we want to leave lots of headroom before they alert anyways, I would just ignore the inclusion of the extra traffic and count it all the same [14:34:18] (for the query rate I mean) [14:34:58] e.g. in eqiad you can see that the query rate for authdns1001 is ~1K/s, but for dns1001 and dns1002 it's like 5-10/s [14:35:15] so the recdns-induced query rate is basically negligble [14:35:32] esams peakjs around 1.5K/s [14:36:28] so those are roughly pps numbers in (because DNS queries are very typically 1x UDP packet, the exceptions are statistically negligible) [14:37:13] peaks are ~1.5K to ns2, ~1.0K to ns1, ~1.0K to ns0 [14:37:58] for the host-level graphs in mbps though, yeah, it's hard to separate the traffic [14:38:25] in any case, for the NTT thing we're probably looking for some kind of injury threshold [14:38:42] there's no reason we shouldn't be able to handle, say ~50Kpps. [14:39:05] maybe alert at 25K and high-sev at 50K for pps for authdns? [14:42:35] looking at the host-level network stats in eqiad as an example though: the authdns1001 instance is staying under 250Kbps, while the dns100[12] instances are more like staying just under ~1.5MBbps [14:43:22] so for the public filtering, say 1MBps is pretty good headroom to even alert, and high-sev maybe more like 10Mbps? [14:43:34] bblack: eh, that's the tresholds I put so far [14:43:39] perfect :) [14:44:43] godog: I was trying to figure out what "many" meant re: metrics aggregation... what we have is ~25 metrics per server, and 12 total servers (2 in each edge site, 3 in each core) [14:44:54] is that going to blow something up if I just aggregate it without rules? [14:45:10] bblack: their default for Mbps is 25/50Mbps [14:45:18] that seems fine too [14:45:23] ok, cool [14:45:32] I emailed them for clarification on some of the terms [14:45:37] either way we shold be able to handle that, or we have internal work to do [14:45:38] thanks! [14:46:00] yeah the goal is to have them trigger only before we can't handle it [14:46:28] also as said in my email they're only 1 of the 3 or 4 pipes to each DC [14:46:37] yeah there's some open philosophical questions I think, that we should figure out an answer to, about that. [14:46:49] about link saturation vs "we can handle it", etc [14:47:10] it would be convenient to say: we need external help if links are saturated, but otherwise we should be able to handle it inside [14:47:40] which sort of implies we size our critical edge-facing things (e.g. cache clusters, lvs, authdns) to handle the full load our transits can possibly deliver in aggregate [14:47:55] bblack: yeah that's fine, "many" in this context I'd say is >3-5k [14:48:13] XioNoX: anyways, maybe deeper discussion on this belongs elsewhere [14:48:51] yep :) [17:55:52] unimportant nitpicking, but the double fact sync on first puppet run(s) still bugs me [17:56:13] the lib/puppet/lib/puppet re-pathing or whatever it is [18:07:23] bblack: which one? [18:08:46] basically the veyr first (truly very first, even if it's automated) puppet agent run on a host syncs all the facts code to one path under /var/lib/puppet/.... [18:08:57] and then the next run does it all over again, to a slightly different sub-path [18:09:15] it's like /var/lib/puppet becomes /var/lib/puppet/lib/puppet , for this purpose [18:09:43] esp on far-flung edge hosts, it adds a lot of wasted time re-syncing them all again [18:10:35] AFAIK those two dirs have different content [18:10:59] unless we override some default and those get moved from one path to the other [18:11:23] I don't have a log to look at, I can capture it on the next node I do [18:11:42] but I think it's all the same facter code rb files, going into one path the first time then another path the second time [18:12:18] # find /var -name etcd_role.rb [18:12:27] /var/cache/puppet/lib/puppet/type/etcd_role.rb [18:12:31] /var/lib/puppet/lib/puppet/type/etcd_role.rb [18:12:33] maybe this? [18:13:30] also, given we're in topic... I see you're doing them manually, what's missing from the reimage script for those cases? [18:14:14] I just did this one manually so I could watch it. I think the next I'll try auto [18:14:50] it's a hardware change, we're adding new 10G NICs, disabling the old onboard port, updating the mac, reimage [18:14:51] oh nice, I was already fearing something to add to the TODO ;) [18:15:05] but I think it would handle it fine, I just wanted to watch the first one manually [18:15:20] makes total sense [18:33:06] apergos: when is the train going? [18:33:22] later during the scheduled window i guess in 90 min or so [18:33:46] Cool [18:34:16] apergos: is that g2 to .18 as well [18:34:28] no, that's in a few minuts [18:35:34] Nice [18:35:46] Hopefully I won’t be watching it go boom! [19:04:05] volans: asynchronously - I did have a question about a different edge case for wmf-auto-reimage: what if we want to reimage a host from private-subnet to public-subnet (basically move from .eqiad.wmnet to .wikimedia.org with an IP change in DNS and a switch port change). Do we basically downtime and move the DNS and switch port, then launch reimage using the new hostname, or? [19:04:52] volans: I kind of assume we'd have to manual-clean puppet and then do a --new reimage with the new name [19:05:27] let me think one sec about the details [19:08:11] so, the reimage script supports rename while reimage, it's documented on wikitech, as long as both old and new mgmt records exists that's ok. In this case the mgmt record would stay the same if you just change domain and not hostname [19:08:25] the part that is kinda missing a pause in the middle to allow for the physical re-cabling [19:09:03] but would be easy to add [19:09:49] as an immediate workaround if you pass the conftool option to depool, it waits 3 minutes after the depool before rebooting into PXE, if that's enough for the re-cabling might work ;) [19:09:53] bblack ^^^ [19:10:51] https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Rename_while_reimaging [19:12:13] it's not even physical recabling necessarily, just switching the vlan setting on the port config [19:12:54] oh I didn't know wmf-auto-reimage had --rename [19:13:09] oh, "wmf-auto-reimage-host" [19:13:24] got it [19:13:33] that's the one for single host, the wmf-auto-reimage is the wrapper for multiple at the same time [19:13:49] yeah I always have just used the wrapper, even when it's just one :) [19:14:00] tbh that path is not tested often as renames are rare, but IIRC Luca has done more than one and kept it up to date [19:16:12] * volans should really find the time to complete its migration to a cookbook... :( [19:17:49] elukey: re: kraz replacement. want a star name (misc codfw) or create a new cluster name, irc2001 ? [19:18:18] non-cluster names are deprecated AFAIK [19:21:27] hmm "Any system that runs in a dedicated services cluster with other machines " [19:21:49] but it's just one. [19:22:07] oh well, i will upload a DNS change with a cluster name [19:47:00] mutante: yes I think irc2001 is fine! [19:47:23] elukey: ACK, that's what is in Gerrit so far [19:50:30] since it is time sensitive due to the buster deprecation, I'd say that we could go forward for now [19:50:44] err jessie deprecation :) [19:51:10] I am not sure if eventually the service will be hosted in Kubernetes etc.. [19:51:27] but for now we need to test it properly on something similar to kraz :) [19:51:40] (going afk, will read later in case) [19:53:39] elukey: yea, moving it forward. jessie deprecation is my OKR as well [19:53:57] super thanks! [19:59:39] robh: any idea where T232649 is? Looks like those servers were expected to arrive January 7th... [20:00:55] no idea, and i dont see anything stating it landed [20:01:02] the vendor pushed it to end of january before [20:01:05] me neither :) [20:01:08] so just fired off an email asking wtf [20:01:13] so we'll know soon [20:01:19] cool! thanks! [20:01:46] welcome, sorry nothing more definitive [20:01:50] but we'll know within a day or so [20:01:53] hopefully [20:03:03] Ok, all of esams pdus have their power outlets labeled and reboot groups setup for all non-server devices. [20:03:26] and power mappings in netbox. the only cable related fix for netbox i know of this second (will check reports shortly) is ulsfo dupes [20:03:29] \o/ [20:09:24] fyi, I depooled eqsin and will start the cr1-eqsin software upgrade in 50min [20:11:42] ack [21:07:18] I'm here, pagestorm :( [21:07:37] * volans too [21:08:41] XioNoX: you around? [21:08:48] we had the router upgrade [21:08:52] volans: yep [21:08:52] I'm here if needed but about to go to sleep [21:08:55] and then we had "repool cp servers" in eqiad [21:09:04] and everything seems to be down in eqiad [21:09:06] any objections to depool eqiad?? [21:09:14] but ulsfo seems working [21:09:31] cdanis: +1 [21:10:57] no objection for me, bblack do you have any? [21:12:33] which channel are we coordinating on? [21:17:22] ere or operations is better than -security which is a non public channel [21:17:29] yeah let's do here [21:17:36] agreed [21:18:10] so, I couldn't load logstash or grafana myself (and chrome didn't re-read my /etc/hosts immediately when I some names there), and I saw multiple +1s to my suggestion to depool, so I did it [21:18:25] now that I can load logstash, I see a lot of errors in varnish5xx from cp1089 [21:19:11] it is cp1089 that just got reinstalled and repooled [21:19:23] no, it shouldn't [21:19:23] until about a minute ago icinga had trouble talking to 443 on cp1089 but now it's fine again [21:19:29] no, it shouldn't be 1089 [21:20:05] 108[45] were down earlier for firmware, and had just finished and rob had moved to 108[67] [21:20:27] cp1089 has a bunch of stuff in journal from traffic_manager about ERROR: HTTP/2 session error client_ip= session_id= closing a connection, because its stream error rate (2.000000) is too high [21:21:02] yeah [21:21:04] also various varnishkafka errors, varnishospital log overruns [21:21:05] eqiad is the problem [21:21:11] if there is already enough people around I'd rather logoff [21:21:17] go ahead [21:21:29] me too, I was about to go bed [21:21:30] somehow maintenance work in eqiad has really gone off the rails, because like everything is depooled [21:21:47] bblack: you mean LVS-depooled? [21:21:49] oh [21:21:53] no I mean confctl depooled [21:21:59] 1089 is the only text node still in service :p [21:22:03] (in eqiad) [21:22:03] same if it doesn't look like an ongoing issue [21:22:07] Poor 1089 [21:22:19] call me if I'm needed [21:22:21] ongoing may not be the right word [21:22:35] I need to dig through SAL and figure out when/why each was [21:22:37] I am here but only in lurk mode [21:22:50] but probably repools are not happening correctly [21:23:01] there were buster upgrades and rob's firmware upgrades [21:23:06] shdubsh: ^ [21:23:07] im still around if needed [21:23:31] On phone [21:24:15] yeah, one cp_text pooled (1089), two cp_uploads pooled (1088/1099) [21:24:52] traffic throughout, latency look normal [21:24:58] cya [21:25:20] fyi, first alert was a 21:05 UTC [21:25:44] sounds like there is a bug somewhere that made it depool all servers instead of a single one [21:25:47] issue start 21:03, end 21:17 [21:25:56] based on perf degradation metrics [21:26:07] no [21:26:15] we've got a handle on what happened though, see also -traffic [21:26:15] jynus: end 21:17 lines up with 5 minutes after authdns-update, ty [21:26:26] bblack: the based is important [21:26:34] not saying it is over [21:26:50] the based? [21:27:02] start and end of latency issues [21:27:19] not saying root cause has been fixed [21:27:41] I'm not following you [21:27:55] but I know what happened with the pages and eqiad traffic meltdown, there's no mystery [21:27:57] jynus: a root cause has been found :) [21:28:21] I think cdanis understood me [21:28:36] please translate, going to sleep [21:29:00] 0:-) [21:29:38] bblack and all, are we good enough so I can reboot cr1-eqsin? (eqsin is depooled) :) [21:30:16] XioNoX: yes, go ahead [21:31:17] I'm just double-checking things to make sure none of these are depooled for good reasons [21:31:26] bblack: thanks [21:31:32] (as in, were already depooled for unrelated reasons) [21:32:32] eqiad still at 1.4k req/s, super persistent clients [21:34:04] note there are two log entries in https://tools.wmflabs.org/sal/ not also in https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:17] *at least two, 21:11 and 21:13 [21:34:29] I'm backfilling there and in the doc but be warned that apparently that's a thing [21:34:31] rlazarus: https://phabricator.wikimedia.org/T244766 [21:34:48] that's a separate issue I think [21:34:53] oh. ok [21:35:08] oh, yea. what you are describing is "stashbot is only in eqiad" [21:35:12] yeah [21:35:33] yeah, I know my 'depool eqiad' log message failed to write to wiki [21:36:15] πŸ‘ it's there now, because wiki [21:37:01] ./topic because wiki [21:40:23] https://wikitech.wikimedia.org/wiki/Incident_documentation/20200211-caching-proxies#Timeline [21:41:24] I keep bouncing to the other channel sorry :) [21:41:31] (-traffic in this case) [21:41:51] so the minor issue we face now, is the lower-numbered ones that depooled out yesterday, likely aren't very warm [21:42:24] but all things considered, I still think our best bet is a simple dns repool for eqiad [21:42:42] just, we'll have to keep an eye on graphs, and possibly even wait through a minor disturbance in the force [21:42:45] cdanis already uploaded the revert patch [21:42:57] let me finish double-checking pybal state too [21:43:05] please do! [21:43:12] and if you want to +1 https://gerrit.wikimedia.org/r/c/operations/dns/+/571581 when you're done :) [21:45:02] yeah pybal looks healthy, surprisingly! [21:45:14] I'll take it [21:45:16] (sometimes it gets very confused when things go under the threshold and it has no control over that) [21:45:23] any objections to eqiad re-pool? [21:45:33] let's do it [21:48:40] so the hitrate's going to take a little to recover [21:48:57] the minor worry here is we'll see excessive miss traffic and overwhelm some things on the inside, but so far so good [21:50:13] my longer-term wishlist is that someday we move on to a more advanced data model for how depooling works, too [21:50:27] but that might require whole new tooling, I don't even know [21:50:59] you mean site-DNS-pooling, or confctl node/service pooling? [21:51:36] the confctl side [21:52:06] i can put on my confctl maintainer hat if you want πŸ™ƒ [21:52:27] but my gripe isn't with the tool, it's with the data model [21:52:52] pooled=yes|no as a single-bit state (well sure there's a third state, but ignore that for now, it just complicated things pointlessly) [21:53:29] really a better basic model is to have reasons as state, and be able to stack them [21:53:59] so one person can do something like "confctl depool cp1088 'because-hardware-maintenance'" [21:54:20] but someone else the day before also did "confctl depool cp1088 'because-funky-software-issue'" [21:54:47] and these just push reasons onto an array of reasons it's ultimately in the not-pooled state for consumers like pybal (or others) [21:55:04] and they all have to be undone by-name before it comes back to the normal pooled state [21:55:30] mmm, yeah, I think I like that model. [21:55:39] this isn't the issue here, but it's one that bugs me. it's easy to miss right now if you depool for a very short reason X today, but it was already depooled for a longer reason before you started [21:56:03] I think a proper commit history (something better than SAL) would be nice as well [21:56:28] the new thing that occurred to me today, is it would be nice to also put an expected-lease-time on those to help with catching issues, or something [21:56:29] it's something I've very much wanted for dbctl, where the objects involved and the edits users are making to them are more complicated than confctl node objects. [21:57:07] like "depool cp1088 'because-hardware-maint-TNNNNN' expected-time=4h" [21:57:31] or "depool 'auto-depool-on-shutdown' expected-time=1h" [21:57:33] whatever [21:57:47] and then some automated alert for expected-lease-time-expired-too-long-ago ? [21:57:50] so that if a reason stays there longer than expected, that fires an alert or requires some remediation or action [21:58:20] kinda like icinga downtimes [21:58:34] maybe the two could even be linked somehow... [21:58:40] linking the two would be nice. [23:07:31] (also, sorry, didn't realize you both were here when I asked in -cloud-admin :) [23:45:55] bblack: still around?