[08:54:17] <_joe_> cdanis: poolcounter is that thing that never breaks, luckily [08:54:34] <_joe_> I think since I'm here it has had 100% availability or something [08:54:40] <_joe_> (I just jinxed it) [08:55:00] <_joe_> but yes, it's an extremely simple C daemon that exposes no metrics [08:57:23] <_joe_> uhm I stand corrected [08:57:27] <_joe_> it exposes its stats [08:57:55] <_joe_> well I guess once I'm done writing the python poolcounter library, it's easy to create a prometheus exporter from it :) [09:30:41] You can always use redlock and redis https://redis.io/topics/distlock [09:44:04] * _joe_ larts fsero [09:48:15] hey don't lart me, i've used it and is not that bad [10:07:15] <_joe_> must've gotten better lately [10:59:00] MySQL difference has narrowed considerably: https://grafana.wikimedia.org/d/000000332/cluster-hardware-specs-differences?orgId=1&var-datasource=eqiad%20prometheus%2Fglobal&var-cluster=All [10:59:29] there is still a last (10) batch purchase pending [10:59:55] there is likely to always be a difference as lots of cloud ana analyitics will be single-dc [11:00:50] 9 TB may seem extreme, but that is just 18 hosts (out of 150ish) [11:01:26] interestingly, app servers seem to be much better on codfw? [11:04:53] <_joe_> yes, we're rebalancing them next year [11:05:47] I think lots of things will get better when in active-active status [11:11:46] <_joe_> we'll see [11:16:17] <_joe_> completely unrelated: poolcounter has two types of lock: 1 - you want to acquire a lock for yourself, then do work 2 - You want any worker to acquire the lock and do the work [11:16:36] <_joe_> I'm thinking of what's the best pythonic way to expose those two locks [11:18:08] <_joe_> a contextmanager seems like a logical choice, but I'm not sure it fits the second case. Basically I'd need the contextmanager yield only if the lock has been acquired in scenario 1 [11:18:18] I read "the best platonic way to expose those two locks" [11:18:51] <_joe_> in scenario 2, I'd need to return the lock status, and based on that let the user execute the code or not. [11:21:15] Slightly unrelated https://github.com/Wikia/poolcounter-prometheus-exporter [11:22:25] <_joe_> fsero: hah [11:22:53] <_joe_> fsero: write a task to install and use it? [11:23:06] And package it [11:23:10] Sure :p [11:23:18] <_joe_> Metrics are made available at the /prometheus HTTP endpoint. [11:23:23] <_joe_> also patch it [11:23:25] <_joe_> :D [11:24:32] <_joe_> we need to find some time to discuss packaging of go applications in dublin, it's probably most important to our team, but frankly I don't think following staplesberg's trail is particularly useful. [11:24:54] <_joe_> that is creating a package for every dependency and such [11:25:00] yeah, let's have a session on that [11:25:05] <_joe_> with go mod, things should be easier to manage, too [11:25:06] I have my own opinion [11:25:21] <_joe_> I guessed so, care to express it? [11:25:22] Which is use go 1.11 use go mod and commit vendors [11:25:36] Packages are easy to manage that way [11:25:40] <_joe_> yes, that's more or less where I was going to [11:25:55] And allow us to change source in case of hotfix [11:26:11] <_joe_> what do you mean? [11:27:08] We are packaging helmfile and using it. Best workflow is in case of bug report and fix upstream [11:27:42] <_joe_> so usually if the patch is done by me, what I do is the following: [11:27:54] <_joe_> modify the code in our package repo [11:28:02] <_joe_> build it, test it locally [11:28:48] <_joe_> export name="something.patch"; git diff > debian/patches/$name; echo $name >> debian/patches/series [11:28:58] <_joe_> dch and that's it [11:29:07] <_joe_> I also have the patch ready for upstream [11:29:17] <_joe_> ditto if we want to backport an upstream fix [11:29:58] <_joe_> this way when we sync with upstream it's not a rebase hell [11:30:08] <_joe_> does it make sense? [11:30:16] Yep, committing vendors allow us to also change dependencies. Imagine that a dep from our project got a security issue and upstream refuses to update because reasons. Having the vendor commited allow us to also hotfix vendors in addition of the original source [11:30:31] <_joe_> right [11:30:40] <_joe_> intead of having to build two packages [11:30:46] Yes [11:30:50] <_joe_> one of which will never be installed anywhere [11:31:14] <_joe_> we still have the problem of tracking security issues ofc [11:31:54] <_joe_> but that stands more or less for the accepted debian way too [11:32:20] We can have an small bot that looks in Gerrit for go.sum files and report it to ¿Debmonitor? [11:38:07] I can't find the reason why jenkins down-voted my patch: https://integration.wikimedia.org/ci/job/operations-puppet-tests-stretch-docker/14542/console [11:39:53] 12:36:35 + tee /srv/workspace/log/rake.log [11:39:54] 12:36:36 hieradata/hosts/cloudcontrol2001-dev.yaml:6: - prometheus2003.eqiad.wmnet [11:39:54] 12:36:36 hieradata/hosts/cloudcontrol2001-dev.yaml:7: - prometheus2004.eqiad.wmnet [11:39:54] 12:36:36 ---> syntax:manifestsrake aborted! [11:39:54] 12:36:36 Typo found! [11:39:55] 12:36:36 [11:39:57] arturo: ^ [11:40:09] ok, and where is the typo exactly? [11:46:05] arturo: you have 200* hosts in eqiad [11:46:26] oh.... [11:46:37] fyi you can run `bundle exec rake typos` which is sometimes easier to parse [11:46:38] well spotted jbond42, that one was tricky [11:46:56] i.e. it highlights the typo [12:42:30] godog: thanks for editing the varnish50x logstash dashboard! [12:52:13] np! hope that's useful [13:44:41] godog: i think im seeing this issue with mtail at the moment https://github.com/google/mtail/issues/66 [13:44:56] i noticed metrics had stoped this morning here https://grafana.wikimedia.org/d/CVfSPeqiz/iptables?orgId=1&from=now-24h&to=now [13:45:07] looks like it happened after logrotate [14:18:33] godog: cdanis as part of the decommision of the old registry i need to delete the docker_registry swift container on codfw [14:18:49] just a heads up in case you see deletes coming up [14:24:10] ow jbond42 ! confirmed it is the same problem? [14:24:16] fsero: ack, thanks for the heads up! [14:26:03] godog: i have restarted mtail now but was seeing something simlar to the following n both syslog servers [14:26:07] Jun 4 13:50:29 lithium mtail[686]: I0604 13:50:29.945778 686 tail.go:186] read /srv/syslog/swift.log: file already closed [14:40:18] ack jbond42 [16:07:47] is someone working on sessionstore.svc.eqiad.wmnet ? both codfw and eqiad are warnings for 3h - https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=all&type=detail&servicestatustypes=4&hoststatustypes=3&serviceprops=2097162 [16:13:47] I think akosiaris ? [16:15:17] yup, but I had missed that, with it being in warning [16:21:07] ok, first one was easy, looking into the other one now [16:22:25] what is the status of session migration, is that something done, WIP, sheduled? [16:22:31] done [16:22:33] aah [16:22:38] no, WIP [16:22:55] but the service that will get the sessions (instead of redis) is done as of today [16:22:58] so both infra is in paralel right now? [16:23:04] old and new? [16:23:06] yes [16:23:10] cool, thanks [16:23:15] that is useful [16:23:18] will be for some time anyway as redis has a lot more than just sessions [16:23:28] I know, only asked about session [16:23:36] infra not as in redits, but as service [16:23:40] *redis [16:23:51] thanks [16:23:56] yw [16:25:58] is there doc on how to make an icinga alert page? cf. https://phabricator.wikimedia.org/T224535#5234083 [16:27:25] XioNoX: critical=> true IIRC [16:27:46] thx! [17:13:01] If someone is desperatly looking to review some Puppet code - https://gerrit.wikimedia.org/r/c/operations/puppet/+/514332 [19:23:23] I'm also looking for someone who can help me with some python debian packaging, it's making me crazy :) [19:26:52] XioNoX: I can't now, but happy to help at a different time [19:30:34] thx!