[07:41:21] elukey: there's been a critical in icinga for 3 days, "Stale file for node-exporter textfile in eqiad" -> cluster=analytics file=nic_firmware.prom instance=analytics1030 job=node site=eqiad. Known? [07:42:12] ema: yes I am running some horrible tests on the hadoop test cluster, if you could ack those I'd be grateful, otherwise I'll do it later on [07:44:01] elukey: ack'ed, thanks! [08:36:05] etcd1003 has failing its backup several times in a row: https://phabricator.wikimedia.org/P11854 [08:37:02] https://grafana.wikimedia.org/d/413r2vbWk/bacula?orgId=1&from=1593765412147&to=1594370212147&var-site=eqiad&var-job=etcd1003.eqiad.wmnet-Monthly-1st-Fri-production-etcd [08:38:24] I guess it is just the host being unavailable [08:40:00] jynus: yeah, moritzm mentioned some ongoing memory replacement on #-operations I think? [08:40:37] jbond42: o/ - qq - is memcached 1.6.x going to be maintained etc.. for idp? If so, it would be great to think about using it for mediawiki as well, as opposed to using 1.5.x (the buster version) [08:41:50] elukey: that is the intention yes (tag moritzm just in case) [08:42:18] I think I should implement a local ignore list so I can acknowledge failures for an alert, but only partially (for know broken hosts) [08:43:58] as even if we wanted to remove hosts from backup rotation, we couldn't as we cannot run puppet on them [08:55:44] elukey, jbond42: yeah, we'll either cherrypick security fixes to 1.6.6 or follow 1.6.x releases, we'll see [08:56:09] if it's helpful for the mc* hosts, let's do that [08:57:15] jynus: indeed, etcd1003 is affected by the Ganeti1007 memory expansion this week, it's disks are local to ganeti1007, so as long as ganeti1007 is down, is it as well [08:58:16] don't worry, my complain is more towards icinga being unflexible [08:58:35] I do >100 checks, which doesn't make sense to put on individual alerts [08:58:51] but then that means I cannot ack/downtime those individual checks [08:59:04] so I have to implement a pseudo-ack on the check code [08:59:13] (unless someone suggests a better way) [09:13:34] moritzm: the big benefit would be to unlock the extstore capabilities (https://memcached.org/blog/nvm-caching/), that might be something that we want to do in the future [09:14:16] (maybe buying some nvme ssds on mc nodes when refreshing) [09:20:21] and finally someone would take our memcached bug reports seriously since they are coming from the latest release :-) [10:03:39] yes I already lost my credibility with the upstream person handling memcached, since I have been saying that we'd have upgraded to 1.5 from a year ago :D [14:58:13] would anyone have some time for a serie of small cleanup patches for the contint Apache config? ;) [15:29:30] hashar: sure, link? [15:30:01] shdubsh: i have a small serie starting from https://gerrit.wikimedia.org/r/c/operations/puppet/+/607075/ ;) [15:30:16] it is merely cleanup for old legacy stuff that we no more have any use for [15:30:51] oh and Hi ;] [15:32:56] the next one switch to another directory for DocumentRoot in favor of using a scap deployed repository instead of having puppet to git pull constantly ( https://gerrit.wikimedia.org/r/c/operations/puppet/+/607076/ ) [15:33:09] and the next two convert templates to plain raw files [15:38:44] Looks pretty straight forward to me. +1 [15:39:38] shdubsh: would you mind doing the merge / puppet merge dance for me please? ;) [15:40:06] I can run-puppet on the hosts though, but lack the merge rights [15:44:03] ack, ok [15:49:04] looks good on contint2001 :] [15:55:13] Error 500

Invalid context.

[15:55:14] :-\ [15:57:00] shdubsh: the last patch broke the site https://integration.wikimedia.org/ [15:57:09] cause our PHP entry point does some check against DOCUMENT_ROOT bah [15:57:24] hashar: revert? [15:57:27] * shdubsh stages revert [15:57:33] yeah my bad sorry :-\ [15:57:42] I tested a lot of things [15:57:52] but never tried with the actual index.php code we are using :-\\ [15:58:57] shdubsh: sorry :\ [15:59:43] easy fix :) [16:00:07] I spend bunch of time verifying all the paths and the puppet catalog [16:00:20] but haven't done the extra to verify the index.php code bah [16:00:25] if ( !$path || !isset( $_SERVER['DOCUMENT_ROOT'] ) ) { [16:00:25] self::error( 'Invalid context.' ); [16:00:30] stuff like that [16:00:59] it's back again [16:01:12] or maybe because php-fpm has to be restarted again [16:01:18] anyway I gotta set it up locally and try ;] [16:01:33] shdubsh: thank you! [16:02:13] happy to help :) [16:17:59] Sorry if I'm interrupting but does anyone know what to modify in icinga to add the page word to alerts that come to irc? [16:32:48] RhinosF1: it is a part of the monitoring::service define. if $critical is true, it becomes a paging alert [16:46:41] shdubsh: so in the end our did not take in account the DocumentRoot was a symbolic link. I made a proposed fix but all of that will wait for after my vacations :] [16:46:51] shdubsh: thank you for the assistance earlier! [16:48:41] no problem, enjoy the vacation! [16:49:23] thx! [16:52:42] shdubsh: link? [16:54:52] RhinosF1: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/monitoring/manifests/service.pp