[08:55:58] hello hello [08:55:59] https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&refresh=5m [08:56:09] there seem to be a ton more messages for rsyslog [08:58:31] trying to get what it is currently sent to kafka [08:59:55] I just commented the same on another channel [09:00:06] "message":"PHP Notice: Undefined index: x2 in \/srv\/mediawiki\/wmf-config\/etcd.php [09:00:09] ah yes okok [09:00:13] I guess #databases :) [09:00:14] but I am still finging the source [09:00:23] oh [09:00:24] yeah [09:00:32] I am trying to check why it is happening [09:00:38] As I deployed x2 [09:00:40] super thanks a lot [09:00:44] But don't get what I have missed [09:00:44] I was about to ping you [09:01:23] is that happening on all nodes, or just one? [09:01:36] almost all mw servers what I can see [09:01:43] yeah, it should be everywhere [09:02:00] I went on logstash1010 and executed kafkacat -t rsyslog-notice -C -b localhost:9092 [09:02:21] I am not sure what else needs to be added to dbctl when deploying a new section there [09:02:36] ah this is the new mainstash section [09:02:41] yeah, which is not in use [09:02:45] I just deployed the config [09:02:51] So it is available for use [09:02:56] but it has no data or anything [09:03:52] yes yes makes sense [09:04:17] does etcd.php need to be modified in some way? [09:04:21] https://gerrit.wikimedia.org/r/c/operations/puppet/+/649890 [09:04:28] This is all I knew we had to do, and chris confirmed [09:05:08] does it require some kind of "etcd" deployment/reload? [09:05:44] I think so yes [09:05:49] take a look to etcd.php [09:05:57] it mentions es1 -> es5, and x1 [09:05:58] maybe chris forgot to mention that [09:06:00] <_joe_> no it does not [09:06:13] <_joe_> I mean a reload [09:06:14] elukey: but that is a comment on db-eqiad.php [09:06:15] hey, _joe_ any help with this error? [09:06:25] <_joe_> still in bed, but [09:06:37] <_joe_> confctl --object-type mwconfig select 'scope=eqiad,name=dbconfig' get [09:06:44] <_joe_> tells me x2 is correctly defined [09:06:47] mmm [09:06:49] marostegui: so I was reading [09:06:50] // Mediawiki has many different names for the same external storage clusters in dbctl. [09:06:53] // This translates from dbctl's name for a cluster to a set of Mediawiki names for it. [09:06:56] $externalStoreNameMap = [ [09:07:01] then it must be something else [09:07:32] marostegui: ah sorry you are saying that the function is not called, okok [09:07:39] <_joe_> It's possible you needed to add someting to mediawiki-config [09:07:39] (or maybe I missed something) [09:07:57] all the mws complain from line 81 of that file [09:08:00] that contains the map [09:08:04] <_joe_> which file? [09:08:14] /srv/mediawiki/wmf-config/etcd.php [09:08:15] elukey: I am not fully sure about that, but in etcd.php there is indeed x1 but not x2 [09:08:26] But I don't see x1 on db-eqiad.php [09:08:26] yes this is why I was saying that x1 might need to be there [09:08:33] I mean in etcd.php [09:08:49] ah, yes [09:08:52] the mediawiki error is "Undefined index: x2 in \/srv\/mediawiki\/wmf-config\/etcd.php on line 81" [09:08:54] This is what we have in the doc: https://wikitech.wikimedia.org/wiki/Dbctl#Add_new_core_section [09:08:54] that is the cluster mapping [09:09:08] that is an ancient thing that is definitely needed [09:09:14] <_joe_> yes you need to add x2 to that map [09:09:19] probably documentation was skipped [09:09:21] <_joe_> marostegui: that's for a core section [09:09:27] marostegui, let's prepare a patch [09:09:45] ok, let me send a patch [09:09:49] <_joe_> you need to add a mapping for x2, yes [09:10:11] <_joe_> if you follow the code, the issue is that you actually added the data to dbctl correctly [09:10:29] <_joe_> so now you're cycling over a key that's not defined in that array [09:10:31] <_joe_> hence the notice [09:10:47] https://gerrit.wikimedia.org/r/658218 is this enough? [09:10:53] users should not be affected, as nowere on the code extension2 should be used at the moment [09:11:10] marostegui: I know that you missed deploying mw-config [09:11:14] hahaha [09:11:15] marostegui, I think that should be it [09:11:20] let me +1 it [09:11:34] and if it works, we should amend docs [09:11:35] I will document this on the dbctl page later [09:11:38] yep [09:11:42] <_joe_> yes it will work [09:11:57] _joe_: do we need to manually reload etcd after merging+deploying? [09:12:01] <_joe_> no [09:12:06] <3 [09:12:08] <_joe_> definitely not [09:12:27] <_joe_> in fact, the whole problem is that the etcd part is working correctly, but you're missing the mapping [09:12:38] <_joe_> that code could use some love :) [09:12:47] * _joe_ goes back to bed [09:12:54] _joe_: thank you, get better [09:15:04] deployed, let's see if the graphs recovers [09:17:08] I still see a lot of errors from kafka [09:17:29] with a recent timestamp [09:18:15] there was some lag detected, so let's give it a few minutes [09:18:30] marostegui: I am on mw1382 and etcd.php didn't change [09:18:32] I'm here too if needed on the logstash side [09:18:38] elukey: :-/ [09:18:51] maybe mistake on rebase or something? [09:18:59] nope, I did the rebase and I saw my change [09:19:36] Ah [09:19:38] I know what it is [09:19:42] ? [09:19:49] I am stupid: !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php [09:20:00] :D [09:20:01] I just have db-eqiad.php as an automation on my mind [09:20:03] he [09:20:10] all right found the trouble [09:20:51] I think that is exactly why deployers are asked to test changes after it :-D [09:21:14] deployed [09:21:27] * jynus crosses fingers [09:22:14] This is looking better now: https://grafana.wikimedia.org/d/000000561/logstash?viewPanel=29&orgId=1&refresh=30s&from=now-1h&to=now [09:22:14] yep I see the change on mw1382 [09:22:17] errors going down [09:22:20] I wish covid graphs would be like this [09:22:27] +1 [09:23:06] I don't see the error anymore on kafka as well [09:23:20] we should be happy of the things that went well- we had monitoring to catch it [09:23:56] thanks everyone for the help, I am going to document this [09:24:01] <3 [09:28:45] mmmm message rate reduced to half, not sure if that is lag or another error is still hitting [09:30:29] that's likely logstash consuming the kafka backlog now [09:31:27] ok [09:31:49] https://grafana.wikimedia.org/d/n3yYx5OGz/kafka-by-topic?orgId=1&from=now-24h&to=now&refresh=5m&var-datasource=eqiad%20prometheus%2Fops&var-kafka_cluster=logging-eqiad&var-kafka_broker=All&var-topic=rsyslog-notice [09:31:57] funny thing it is only affecting half of the servers on the list as I mentioned on other channel [09:32:15] so traffic in for rsyslog-notice dropped, but indeed out is still high, as godog was saying [09:55:26] interestingly, it increased a bit again [09:55:34] and so did the dropped messages [09:56:21] yeah that was me bouncing apache2 on logstash1024, it was using a whole lot of cpu [09:57:23] the kafka backlog should be cleared in the next hour I think [09:59:48] cool, thank you [15:10:25] andrewbogott, Jeff_Green: if you use the wmf-sre-laptop deb package ( https://wikitech.wikimedia.org/wiki/Wmf-sre-laptop ) or follow https://wikitech.wikimedia.org/wiki/Production_access#Known_host_files you can `ssh icinga.wikimedia.org` and it will go to the right one ;) [15:13:03] I'm on an RPM distro, but I might be able to remember the cname! :-) [15:13:53] the second link is distro-agnostic, works on macos too [15:14:20] the cname per-se is not enough because of strict checking of the host keys [15:52:07] why are icinga* still around anyway? alert* replaced them four months ago? [18:02:50] <_joe_> k8sh that you've seen in the presentation is this silly little thing https://github.com/lavagetto/k8sh [18:03:02] <_joe_> I just made it marginally less shameful over the weekend [18:03:23] <_joe_> but it's very useful if you want to do some debugging inside kubernetes [18:50:04] did we lose some connectivity between eqiad and codfe for some time? there are a few alerts going off and recoverying a few minutes ago [19:03:55] jynus: can you be more specific? [19:04:12] having less noise here [19:04:25] or, you mean my last msg [19:04:36] yes, re: network connectivity [19:04:37] we got an alert for alert2001 [19:04:42] if you are talking about the icinga meta-monitoring email re: alert2001, that is not network-related [19:04:45] and a few jobs failures [19:04:55] there's an unfortunate race condition that pops up from time to time [19:04:55] let me find those [19:07:51] there wasa few stuff at https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets?orgId=1 [19:08:09] plus the pc lag issues at the same time [19:08:23] so I thought it could be intra-dc communicatgion [19:08:42] but if you know metamonitoring has other causes, those other things could be other stuff [19:09:03] that alert is quite noisy [19:09:13] sorry, no problem, I just didn't know [19:10:17] from my perspective, I hadn't seen if fail since july next week, but I may be missing the extra alerts [19:10:39] oh, I see, I was reading the month wrongly [19:10:49] 7/12 not 12/7 [19:10:59] my fault [19:16:40] the other hint is it failing at approx :34 past the hour [19:16:57] which is around when the alert1001->alert2001 rsync runs, and puts alert2001 into an inconsistent state until icinga restarts there [19:17:05] usually it is quick enough for metamonitoring to not notice [19:18:16] oh, nice! I didn't know about that [22:52:06] interesting -- it keeps flapping on the secondary host :/ I wonder if Icinga is just taking longer to start up due to some recent configuration change or something