[00:00:52] <yuvipanda>	 !log depool tools-worker-1017 for T141126
[00:00:53] <stashbot>	 T141126: Investigate moving docker to use direct-lvm devicemapper storage driver - https://phabricator.wikimedia.org/T141126
[00:00:53] <labs-morebots>	 depool is not a valid project.
[00:01:00] <yuvipanda>	 !log tools depool tools-worker-1017 for T141126
[00:01:01] <stashbot>	 T141126: Investigate moving docker to use direct-lvm devicemapper storage driver - https://phabricator.wikimedia.org/T141126
[00:01:04] <labs-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master
[00:03:07] <shinken-wm>	 PROBLEM - Host tools-worker-1016 is DOWN: CRITICAL - Host Unreachable (10.68.21.6)
[00:55:34] <wikibugs>	 06Labs, 10Mail: failed exim service on labs instances - https://phabricator.wikimedia.org/T135033#2513832 (10Andrew)
[00:55:36] <wikibugs>	 06Labs: confirm that new base labs base image is adequate for kubernetes &c. - https://phabricator.wikimedia.org/T134944#2513830 (10Andrew) 05Open>03Resolved The image named debian-8.5-jessie  now has cgroups enabled on first boot.
[00:55:44] <wikibugs>	 06Labs, 10Quarry: Phantom entries in "Quarry" query and or labs replica of enwiki db - https://phabricator.wikimedia.org/T141818#2513833 (10Danny_B)
[01:11:21] <Krenair>	 !log deployment-prep Proper SSL certificate up at https://upload.beta.wmflabs.org - HTTP has been changed to force TLS redirect.
[01:11:21] <stashbot>	 Please !log in #wikimedia-releng for beta cluster SAL
[01:12:17] <labs-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/SAL, Master
[02:00:30] <wikibugs>	 10Wikibugs: Wrong comment anchors linked - https://phabricator.wikimedia.org/T141837#2514045 (10Danny_B)
[02:50:01] <shinken-wm>	 PROBLEM - Puppet staleness on tools-grid-master is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [43200.0]
[02:51:11] <Krenair>	 !log deployment-prep https://deployment.wikimedia.beta.wmflabs, https://meta.wikimedia.beta.wmflabs, and their mobile variants now also have valid certs and TLS redirects.
[02:51:11] <stashbot>	 Please !log in #wikimedia-releng for beta cluster SAL
[02:51:16] <labs-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/SAL, Master
[03:01:29] <wikibugs>	 06Labs, 10wikitech.wikimedia.org: Wikitech sign-up page has bad styling - https://phabricator.wikimedia.org/T136032#2514078 (10Krinkle)
[03:05:07] <wikibugs>	 06Labs, 10Labs-Infrastructure, 10Beta-Cluster-Infrastructure, 06Operations: beta: Get SSL certificates for *.{projects}.beta.wmflabs.org - https://phabricator.wikimedia.org/T50501#2514079 (10Krenair) a:03Krenair
[03:06:04] <wikibugs>	 06Labs, 10Labs-Infrastructure, 10Beta-Cluster-Infrastructure, 06Operations: beta: Get SSL certificates for *.{projects}.beta.wmflabs.org - https://phabricator.wikimedia.org/T50501#527800 (10Krenair) This is now working for meta.wikimedia.beta.wmflabs.org and deployment.wikimedia.beta.wmflabs.org (and their...
[03:37:03] <David__>	 git-commit[1]
[05:03:50] <bd808>	 Krenair: you got certs on beta?! awesome
[09:38:28] <godog>	 !log tools bounce morebots production
[09:38:32] <labs-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master
[10:46:28] <paladox>	 valhallasw`vecto: Hi im wondering if you could take a look at https://phabricator.wikimedia.org/T141329 please?
[10:46:36] <paladox>	 Im not sure if it requires a change in the bot
[10:46:41] <paladox>	 or if gerrit has a bug.
[10:46:42] <paladox>	 ?
[10:48:04] <paladox>	 https://phabricator.wikimedia.org/diffusion/TGRT/
[10:52:47] <paladox>	 Guest18760: Hi im wondering if you could take a look at https://phabricator.wikimedia.org/T141329 please?
[10:52:59] <paladox>	 Im not sure if the bot needs updating to support changes in gerrit
[10:53:03] <paladox>	 or gerrit has a bug
[10:53:03] <paladox>	 ?
[10:53:05] <paladox>	 please
[10:53:05] <paladox>	 ?
[11:13:49] <shinken-wm>	 PROBLEM - Puppet run on tools-services-02 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0]
[11:48:51] <shinken-wm>	 RECOVERY - Puppet run on tools-services-02 is OK: OK: Less than 1.00% above the threshold [0.0]
[11:52:41] <grrrit-wm>	 (03Draft1) 10Paladox: Fix this so it correctly says who the user is changing the patch [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/302416 (https://phabricator.wikimedia.org/T141329) 
[11:52:45] <grrrit-wm>	 (03Draft2) 10Paladox: Fix this so it correctly says who the user is changing the patch [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/302416 (https://phabricator.wikimedia.org/T141329) 
[12:17:46] <grrrit-wm>	 (03CR) 10Paladox: "I wonder if we should do message.patchSet.uploader.name" [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/302416 (https://phabricator.wikimedia.org/T141329) (owner: 10Paladox)
[13:53:07] <gehel>	 There is a shinken alert on deployment-elastic06 (SSH not answering). I can't get console or logs on Horizon. And I can't SSH into it.
[13:53:21] * gehel is a bit lost and would appreciate help from the greater minds...
[13:53:53] <Krenair>	 I was asking about that yesterday
[13:54:18] <Krenair>	 gehel, is it safe to reboot?
[13:55:05] <gehel>	 Krenair: should be no issue. I can try that, but I was wondering if I could collect any info before the reboot
[13:55:26] <Krenair>	 Well
[13:56:02] <Krenair>	 you have ops rights in prod
[13:56:11] <Krenair>	 and the server responds to ping
[13:56:21] <Krenair>	 you could log into labcontrol1001.wikimedia.org and use sudo salt cmd.run
[13:56:46] <gehel>	 Krenair: thanks, I'll try that
[13:57:10] <Krenair>	 `sudo salt 'deployment-elastic06.deployment-prep.eqiad.wmflabs' cmd.run $cmd` I think it'd be
[13:57:45] <gehel>	 Krenair: I have some idea about salt. Just did not know we had a saltmaster for deployment-prep...
[13:57:56] <Krenair>	 Ohhhh right deployment-prep
[13:57:59] <Krenair>	 Hang on
[13:58:06] <Krenair>	 For most labs instances, that would be the place to use
[13:58:11] <Krenair>	 For deployment-prep, we do have a specific saltmaster
[13:58:40] <gehel>	 deployment-salt02.deployment-prep.eqiad.wmflabs 
[13:58:42] <gehel>	 ?
[13:58:44] <Krenair>	 yes
[13:59:02] <Krenair>	 krenair@deployment-salt02:~$ sudo salt 'deployment-elastic07.deployment-prep.eqiad.wmflabs' cmd.run id
[13:59:02] <Krenair>	 deployment-elastic07.deployment-prep.eqiad.wmflabs:
[13:59:02] <Krenair>	     uid=0(root) gid=0(root) groups=0(root)
[13:59:07] <Krenair>	 same thing should work for deployment-elastic06
[13:59:34] <Krenair>	 Though it seems that it doesn't:
[13:59:34] <Krenair>	 krenair@deployment-salt02:~$ sudo salt 'deployment-elastic06.deployment-prep.eqiad.wmflabs' cmd.run id
[13:59:34] <Krenair>	 krenair@deployment-salt02:~$ 
[13:59:44] <Krenair>	 I have no other ideas
[14:01:06] <gehel>	 Krenair: Thanks for the help! I'll do the hard reboot stuff...
[14:02:20] <gehel>	 !log rebooting deployment-elastic06 (unresponsive to SSH and Salt)
[14:02:21] <labs-morebots>	 rebooting is not a valid project.
[14:02:30] <gehel>	 !log deployment-prep rebooting deployment-elastic06 (unresponsive to SSH and Salt)
[14:02:31] <stashbot>	 Please !log in #wikimedia-releng for beta cluster SAL
[14:02:35] <labs-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/SAL, Master
[14:02:44] <gehel>	 damn... I'll find the right log eventually...
[14:04:09] <Krenair>	 This is the right log
[14:04:11] <Krenair>	 Ignore stashbot
[14:04:36] <Krenair>	 deployment-prep is not special it is a labs project and labs project SALs at at Nova_Resource:$project/SAL
[14:04:46] <Krenair>	 labs-morebots handles sending entries there
[14:04:47] <labs-morebots>	 I am a logbot running on tools-exec-1220.
[14:04:47] <labs-morebots>	 Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log.
[14:04:47] <labs-morebots>	 To log a message, type !log <msg>.
[14:07:53] <Krenair>	 oh yeah, the other problem is that the bot usually in -releng is not currently running.
[14:08:00] <Krenair>	 not sure why
[14:10:19] <gehel>	 ok, so I'm confused, but it seems to be normal :)
[15:04:35] <grrrit-wm>	 (03PS3) 10Paladox: Fix this so it correctly says who the user is changing the patch [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/302416 (https://phabricator.wikimedia.org/T141329) 
[15:07:16] <grrrit-wm>	 (03PS4) 10Paladox: Fix this so it correctly says who the user is changing the patch [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/302416 (https://phabricator.wikimedia.org/T141329) 
[15:19:54] <gehel>	 chasemp: https://gerrit.wikimedia.org/r/#/c/302450/ created. I added you and thcipriani for review, but I think you gave me a third person... (I closed the hangout too fast...)
[15:25:03] <gehel>	 chasemp: disregard, paravoid_ has already merged it...
[15:25:10] <chasemp>	 gehel: yep cool
[15:25:50] <shinken-wm>	 PROBLEM - Puppet run on tools-proxy-01 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0]
[15:35:27] <shinken-wm>	 PROBLEM - Puppet run on tools-flannel-etcd-03 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0]
[15:36:33] <grrrit-wm>	 (03CR) 10Zppix: [C: 031] "+1 for the idea and it doesnt blow up gerrit so i guess it works :P" [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/302416 (https://phabricator.wikimedia.org/T141329) (owner: 10Paladox)
[15:37:53] <shinken-wm>	 PROBLEM - Puppet run on tools-elastic-03 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0]
[15:38:13] <shinken-wm>	 PROBLEM - Puppet run on tools-flannel-etcd-01 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0]
[15:38:54] <shinken-wm>	 PROBLEM - Puppet run on tools-services-01 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0]
[15:40:44] <shinken-wm>	 PROBLEM - Puppet run on tools-proxy-02 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0]
[15:41:07] <paladox>	 Puppet is broken in labs please see -operations
[15:42:42] <shinken-wm>	 PROBLEM - Puppet run on tools-k8s-master-01 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0]
[15:45:14] <shinken-wm>	 PROBLEM - Puppet run on tools-logs-01 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]
[15:45:20] <shinken-wm>	 PROBLEM - Puppet run on tools-flannel-etcd-02 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]
[15:45:34] <shinken-wm>	 PROBLEM - Puppet run on tools-k8s-etcd-01 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0]
[15:45:50] <shinken-wm>	 PROBLEM - Puppet run on tools-elastic-01 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0]
[15:47:17] <shinken-wm>	 PROBLEM - Puppet run on tools-k8s-etcd-02 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]
[15:50:57] <shinken-wm>	 PROBLEM - Puppet run on tools-k8s-etcd-03 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0]
[15:51:48] <Luke081515>	 o.O
[15:55:04] <shinken-wm>	 PROBLEM - Puppet run on tools-elastic-02 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0]
[15:58:17] <halfak>	 o/ Grafana for labs seems to not be working.  
[15:58:25] <halfak>	 It looks like I'm running into this issue: https://github.com/grafana/grafana/issues/4499
[15:59:13] <halfak>	 It seems like everyone is saying "It's CORS" 
[15:59:16] <halfak>	 Could it be CORS?
[16:00:34] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1213 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0]
[16:05:12] <chasemp>	 halfak: minor emergency here but can you elaborate on what not working means specifically?
[16:05:21] <chasemp>	 site down, some error, some action missing
[16:09:54] <Krenair>	 what's the site down?
[16:10:25] <chasemp>	 grafana issues but I'm unsure where or what is all
[16:20:00] <halfak>	 chasemp, sorry I was AFK.  For all panes, I'm getting "Cannot read property 'message' of null"
[16:20:14] <halfak>	 This is new as of a week or two ago.  So it isn't urgent
[16:21:03] <halfak>	 This only happens for the labs graphite datasource. 
[16:21:09] <halfak>	 So I think that's where the issue lies. 
[16:22:42] <shinken-wm>	 RECOVERY - Puppet run on tools-k8s-master-01 is OK: OK: Less than 1.00% above the threshold [0.0]
[16:24:36] <chasemp>	 halfak: is thsi https://grafana-labs.wikimedia.org/?
[16:25:14] <shinken-wm>	 RECOVERY - Puppet run on tools-logs-01 is OK: OK: Less than 1.00% above the threshold [0.0]
[16:25:14] <halfak>	 Oh.  No, I've been using the grafana datasource in grafana.wikimedia.org
[16:25:20] <shinken-wm>	 RECOVERY - Puppet run on tools-flannel-etcd-02 is OK: OK: Less than 1.00% above the threshold [0.0]
[16:25:27] <halfak>	 That's what I was directed to do a couple months ago. :) 
[16:25:36] <shinken-wm>	 RECOVERY - Puppet run on tools-k8s-etcd-01 is OK: OK: Less than 1.00% above the threshold [0.0]
[16:25:38] <halfak>	 E.g. https://grafana.wikimedia.org/dashboard/db/ores-labs
[16:25:44] <chasemp>	 yeah I thikn yuvi and godog (?) changed it out from under you
[16:25:50] <shinken-wm>	 RECOVERY - Puppet run on tools-elastic-01 is OK: OK: Less than 1.00% above the threshold [0.0]
[16:25:51] <chasemp>	 but honestly I'm not entirely positive what the intention is atm
[16:26:11] <halfak>	 Boo.  Still, that datasource should still work from grafana.wikimedia.org, right?
[16:26:26] <chasemp>	 halfak: https://phabricator.wikimedia.org/T120295#2492682
[16:26:29] * halfak is not a fan of needing to change domains
[16:26:49] <halfak>	 Arg. 
[16:26:51] <chasemp>	 I only recall this in passing so you'll have to read up on that task w/ me on current idea
[16:27:14] <godog>	 afaik the source hasn't been removed from production grafana
[16:27:16] <shinken-wm>	 RECOVERY - Puppet run on tools-k8s-etcd-02 is OK: OK: Less than 1.00% above the threshold [0.0]
[16:27:30] <godog>	 the graphite labs source that is
[16:27:46] <halfak>	 Gotcha.  Thanks.  So it seems that this was a change that rendered a bunch of dashboards unusable and now I need to go digging for all of the places that we linked to the grafana.wikimedia.org and change it to grafana-labs.wikimedia.org. 
[16:27:50] <bd808>	 looks like the url needs changed. the js is doing a post that is getting a 302 response to the new domain
[16:28:01] <halfak>	 godog, but it seems to not be working. 
[16:28:25] <halfak>	 If that could be repaired, I'd appreciate keeping the dashboard where it is. 
[16:28:31] <halfak>	 But if I need to move, then I need to move. 
[16:28:47] <chasemp>	 bd808: is it pointed at labmon instead of graphite-labs?
[16:28:52] <chasemp>	 halfak: ^ I wonder
[16:28:55] <shinken-wm>	 RECOVERY - Puppet run on tools-services-01 is OK: OK: Less than 1.00% above the threshold [0.0]
[16:29:08] <chasemp>	 or just graphite.wmflabs.org I guess
[16:29:13] <chasemp>	 and now it's graphite-labs.wikimedia.org
[16:29:26] <bd808>	 yeah
[16:29:38] <bd808>	 there is a redir at the old domain, but that doesn't work with post
[16:29:51] <chasemp>	 that makes sense
[16:30:03] <shinken-wm>	 RECOVERY - Puppet run on tools-elastic-02 is OK: OK: Less than 1.00% above the threshold [0.0]
[16:30:12] <bd808>	 I think we just need to fix the "wmflabs-graphite" datasource config in grafana
[16:30:23] * bd808 is trying to find the right screen
[16:30:59] <shinken-wm>	 RECOVERY - Puppet run on tools-k8s-etcd-03 is OK: OK: Less than 1.00% above the threshold [0.0]
[16:31:26] <chasemp>	 godog: It seems like maybe ancillary breakage from service url change fyi
[16:32:18] <godog>	 chasemp: yup, thanks!
[16:32:42] <bd808>	 :(( I get permission denied from https://grafana-admin.wikimedia.org/datasources 
[16:33:01] <bd808>	 not sure what ldap group has the rights to change things there
[16:33:13] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1221 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]
[16:33:43] <godog>	 yeah I get the same
[16:34:12] <chasemp>	 huh
[16:34:18] <chasemp>	 yeah
[16:35:02] <bd808>	 I wonder if there is a special non-ldap user with the "admin" role?
[16:35:08] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1206 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]
[16:35:13] * bd808 hasn't messed with grafana much
[16:35:16] <shinken-wm>	 PROBLEM - Puppet run on tools-precise-dev is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]
[16:35:17] <godog>	 mhh https://grafana-admin.wikimedia.org/admin/orgs/edit/1 sez a bunch of people have admin
[16:35:22] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1202 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0]
[16:35:50] <shinken-wm>	 RECOVERY - Puppet run on tools-proxy-01 is OK: OK: Less than 1.00% above the threshold [0.0]
[16:36:09] <godog>	 all of this btw is blocked on a ldap overlay to be able to shove grafana's idea of the world into our ldap
[16:37:11] <bd808>	 halfak: have you opened a ticket? We can at least document what needs changing
[16:37:33] <halfak>	 bd808, will make one now. 
[16:37:47] <bd808>	 graphite.wmflabs.org -> graphite-labs.wikimedia.org in https://grafana-admin.wikimedia.org/datasources
[16:38:33] <chasemp>	 how many ops,devs, and research scientists does it take to change a grafana setting :)
[16:38:39] <chasemp>	 I feel a joke coming on here but I'm not creative enough
[16:38:45] <halfak>	 ha.  
[16:38:53] <godog>	 halfak: btw in the meantime since the dashboard is broken anyway, it can be moved to grafana-labs fairly easily by downloading the json from https://grafana.wikimedia.org/api/dashboards/db/ores-labs / change the datasource and import the json into grafana-labs
[16:39:20] <chasemp>	 godog: ah nice I didn't realize you could bulk export like that
[16:39:22] <shinken-wm>	 PROBLEM - Puppet run on tools-docker-test-05 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]
[16:39:29] <halfak>	 godog, yeah.  Familiar with that transfer pattern.  However, I'd have to change every link on the wiki to that too
[16:39:42] <halfak>	 Which would be fine if I'm only doing it once. 
[16:39:48] <halfak>	 Although this is already the second move. 
[16:39:49] <halfak>	 :) 
[16:39:52] <shinken-wm>	 PROBLEM - Puppet run on tools-services-01 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0]
[16:40:14] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1407 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0]
[16:40:22] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-lighttpd-1206 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0]
[16:40:24] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1403 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0]
[16:40:26] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1210 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0]
[16:40:27] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1214 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0]
[16:40:30] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1401 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0]
[16:40:36] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-1213 is OK: OK: Less than 1.00% above the threshold [0.0]
[16:40:37] <shinken-wm>	 PROBLEM - Puppet run on tools-worker-1003 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0]
[16:40:37] <godog>	 heh, yeah I don't think grafana supports redirects for dashboards, what I did previously is add a new panel with the new url
[16:40:38] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1402 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0]
[16:40:49] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-lighttpd-1405 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0]
[16:41:03] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1201 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]
[16:41:03] <godog>	 it would be a nice feature tho
[16:41:20] <wikibugs>	 06Labs, 10Labs-Infrastructure, 10Graphite: Can't use wmflabs graphite datasource in grafana.wikimedia.org - https://phabricator.wikimedia.org/T141891#2515780 (10Halfak)
[16:41:23] <halfak>	 https://phabricator.wikimedia.org/T141891
[16:41:47] <shinken-wm>	 PROBLEM - Puppet run on tools-docker-test-04 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]
[16:42:07] <shinken-wm>	 PROBLEM - Puppet run on tools-redis-1002 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0]
[16:42:29] <shinken-wm>	 PROBLEM - Puppet run on tools-prometheus-01 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0]
[16:42:37] <shinken-wm>	 PROBLEM - Puppet run on tools-bastion-05 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0]
[16:42:39] <shinken-wm>	 PROBLEM - Puppet run on tools-checker-02 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]
[16:42:45] <chasemp>	 thanks halfak sorry for leading you down a wrong path in the beginning :) taht should be a solid url from on out
[16:43:03] <shinken-wm>	 PROBLEM - Puppet run on tools-mail-01 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0]
[16:43:11] <halfak>	 OK.  Do you advice that we make a switch now chasemp?
[16:43:43] <shinken-wm>	 PROBLEM - Puppet run on tools-k8s-master-01 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0]
[16:44:04] <chasemp>	 As I understand it graphite-labs.wikimedia.org is here to stay and we should fight to death over it from here on out
[16:44:07] <chasemp>	 if I'm you I guess I would 
[16:44:15] <ShakespeareFan00>	 Something's wrong with a query - https://quarry.wmflabs.org/query/6012
[16:44:28] <halfak>	 Cool will do chasemp 
[16:44:29] <ShakespeareFan00>	 It should be dropping anything that alreay has a rationale and isn't
[16:44:34] <ShakespeareFan00>	 Suggestions?
[16:44:45] <shinken-wm>	 PROBLEM - Puppet run on tools-worker-1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0]
[16:44:49] <shinken-wm>	 PROBLEM - Puppet run on tools-services-02 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0]
[16:45:13] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-gift is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]
[16:45:14] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1212 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]
[16:45:22] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-lighttpd-1210 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]
[16:45:26] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-lighttpd-1209 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]
[16:45:28] <halfak>	 chasemp, I should be using my wikitech login info at graphite-labs.wikimedia.org, right?
[16:45:34] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-lighttpd-1403 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0]
[16:45:44] <Krenair>	 shinken-wm, I'd ask Yuvi in a few hours
[16:45:47] <Krenair>	 ShakespeareFan00, ^
[16:45:52] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-lighttpd-1205 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0]
[16:45:52] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1208 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0]
[16:45:59] <bd808>	 halfak: yeah
[16:46:09] <ShakespeareFan00>	 I think it's that my query is wrong 
[16:46:12] <halfak>	 Hmmm... not working.  Let me try some sanity checks
[16:46:18] <shinken-wm>	 PROBLEM - Puppet run on tools-logs-01 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]
[16:46:20] <shinken-wm>	 PROBLEM - Puppet run on tools-flannel-etcd-02 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]
[16:46:30] <shinken-wm>	 PROBLEM - Puppet run on tools-worker-1017 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]
[16:46:34] <shinken-wm>	 PROBLEM - Puppet run on tools-k8s-etcd-01 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]
[16:46:38] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1406 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0]
[16:46:38] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1217 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0]
[16:46:43] <bd808>	 ShakespeareFan00: jsut glanced, but how does an IN where cond exclude something?
[16:46:50] <shinken-wm>	 PROBLEM - Puppet run on tools-elastic-01 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0]
[16:46:56] <chasemp>	 huh I can't update the topic
[16:47:00] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1409 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]
[16:47:01] <ShakespeareFan00>	 bd808: It worked previously
[16:47:11] <ShakespeareFan00>	 If you can design the SQL query so it works
[16:47:12] <shinken-wm>	 PROBLEM - Puppet run on tools-worker-1016 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0]
[16:47:34] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-lighttpd-1412 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]
[16:47:36] <shinken-wm>	 PROBLEM - Puppet run on tools-worker-1009 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]
[16:47:40] <shinken-wm>	 PROBLEM - Puppet run on tools-web-static-01 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]
[16:47:43] <bd808>	 chasemp: I think the channel perms got locked down due to the spammer the other day
[16:47:50] <shinken-wm>	 PROBLEM - Puppet run on tools-mail is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0]
[16:47:51] <chasemp>	 well bunk
[16:47:58] <ShakespeareFan00>	 What i want to do is exclude the listed template links
[16:47:59] <chasemp>	 bd808: can you update Normal to Upgrade in Progress?
[16:48:02] <shinken-wm>	 PROBLEM - Puppet run on tools-worker-1020 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [0.0]
[16:48:08] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1216 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0]
[16:48:16] <shinken-wm>	 PROBLEM - Puppet run on tools-k8s-etcd-02 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]
[16:48:19] <bd808>	 chasemp: :( I have no rights that I know of
[16:48:41] <shinken-wm>	 PROBLEM - Puppet run on tools-bastion-02 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]
[16:48:43] <shinken-wm>	 PROBLEM - Puppet run on tools-worker-1018 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0]
[16:49:21] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-lighttpd-1410 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0]
[16:49:22] <Krenair>	 ShakespeareFan00, your query worked fine on the replicas themselevs
[16:49:24] <Krenair>	 themselves
[16:49:26] <Krenair>	 I checked that
[16:49:32] <ShakespeareFan00>	 Odd.
[16:49:47] <Krenair>	 Otherwise I wouldn't've told you to wait for yuvi :)
[16:50:11] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-generic-1405 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0]
[16:50:14] <ShakespeareFan00>	 Which queires?
[16:50:15] <bd808>	 Krenair: do you have rights to change topic here (channel is now +t)
[16:50:17] <halfak>	 yeah.  I have confirmed that I can't log into graphite-labs.wikimedia.org with the same creds that work at wikitech.wikimedia.org
[16:50:18] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1410 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0]
[16:50:26] <shinken-wm>	 PROBLEM - Puppet run on tools-bastion-03 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]
[16:50:28] <ShakespeareFan00>	 The ones on the phab ticket or the NFUR finding ones mentioned in IRC today?
[16:50:30] * halfak can wait until the issues are resolved. 
[16:50:40] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-cyberbot is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0]
[16:50:42] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1209 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0]
[16:50:42] <Krenair>	 bd808, no. I have no special IRC rights outside of #mediawiki
[16:50:53] <bd808>	 k
[16:51:00] <shinken-wm>	 PROBLEM - Puppet run on tools-worker-1011 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0]
[16:51:04] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-lighttpd-1409 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0]
[16:51:06] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-lighttpd-1408 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0]
[16:51:11] <Krenair>	 Yuvi and And.rew are ops in here
[16:51:32] <chasemp>	 it's cool no biggie, I'll add changing topic here to our big changes check list
[16:51:37] <chasemp>	 and we'll have to dole out some perms
[16:51:39] <Krenair>	 as are Coren, Ryan and Sumana
[16:51:49] <ShakespeareFan00>	 BOther
[16:51:51] <Krenair>	 and jeremyb
[16:51:51] <ShakespeareFan00>	 MY fault
[16:51:56] <shinken-wm>	 PROBLEM - Puppet run on tools-worker-1006 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0]
[16:51:57] <Krenair>	 really ShakespeareFan00?
[16:51:58] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-generic-1402 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0]
[16:51:58] <shinken-wm>	 PROBLEM - Puppet run on tools-k8s-etcd-03 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0]
[16:52:04] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-lighttpd-1415 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]
[16:52:05] <ShakespeareFan00>	 Forget a WHERE clause and the synatx checker didn't tell me ;)
[16:52:18] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-lighttpd-1202 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0]
[16:52:22] <Krenair>	 ShakespeareFan00, your results have just appeared
[16:52:22] <shinken-wm>	 PROBLEM - Puppet run on tools-grid-shadow is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0]
[16:52:27] <Krenair>	 in quarry
[16:52:28] <ShakespeareFan00>	 Yeah
[16:52:30] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1220 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0]
[16:52:31] <Krenair>	 maybe it took a while to load?
[16:52:34] <ShakespeareFan00>	 Possibly
[16:52:39] <shinken-wm>	 PROBLEM - Puppet run on tools-worker-1021 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0]
[16:53:05] <ShakespeareFan00>	 THE NFUR Finding queries were badly written
[16:53:15] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-lighttpd-1203 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]
[16:53:24] <ShakespeareFan00>	 I'd left out a WHERE meaning ti was querying something other than what I thought it was ;)
[16:53:39] <ShakespeareFan00>	 Always a logical error to check for :(
[16:53:41] <ShakespeareFan00>	 (sigh)
[16:53:49] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-lighttpd-1201 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0]
[16:54:03] <shinken-wm>	 PROBLEM - Puppet run on tools-docker-test-02 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0]
[16:54:21] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-lighttpd-1411 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]
[16:54:27] <shinken-wm>	 PROBLEM - Puppet run on tools-cron-01 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0]
[16:54:47] <shinken-wm>	 PROBLEM - Puppet run on tools-worker-1010 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0]
[16:54:56] <ShakespeareFan00>	 Krenair : Thanks
[16:55:07] <Krenair>	 ShakespeareFan00, I didn't do anything, you're welcome :)
[16:55:09] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-lighttpd-1208 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [0.0]
[16:55:17] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1207 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0]
[16:55:18] <ShakespeareFan00>	 Thanks for being patient
[16:55:19] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1203 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]
[16:55:21] <Krenair>	 heh
[16:55:31] <ShakespeareFan00>	 And for looking into the phab ticket I raised :)
[16:55:31] <shinken-wm>	 PROBLEM - Puppet run on tools-redis-1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]
[16:55:39] <shinken-wm>	 PROBLEM - Puppet run on tools-web-static-02 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0]
[16:55:57] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1215 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0]
[16:55:59] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-lighttpd-1401 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]
[16:56:03] <shinken-wm>	 PROBLEM - Puppet run on tools-elastic-02 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0]
[16:56:10] <ShakespeareFan00>	 https://quarry.wmflabs.org/query/6052 - Still seeing phantoms here... (but already reported these)
[16:56:38] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1218 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0]
[16:56:52] <shinken-wm>	 PROBLEM - Puppet run on tools-proxy-01 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0]
[16:56:54] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1205 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0]
[16:57:03] <shinken-wm>	 PROBLEM - Puppet run on tools-checker-01 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0]
[16:57:25] <shinken-wm>	 PROBLEM - Puppet run on tools-worker-1022 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]
[16:57:41] <shinken-wm>	 PROBLEM - Puppet run on tools-docker-registry-01 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0]
[16:57:51] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-lighttpd-1404 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0]
[16:57:55] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1204 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0]
[16:57:59] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1219 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0]
[16:58:11] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-lighttpd-1204 is CRITICAL: CRITICAL: 62.50% of data above the critical threshold [0.0]
[16:58:47] <Krenair>	 bd808, okay so it turns out I have a lot more IRC permissions than I thought I did: op in mediawiki, mediawiki-core, mediawiki-feed, mediawiki-visualeditor, wikimedia-editing, wikimedia-staff, and wikimedia-codereview - how do I accumulate all these things?
[16:58:51] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1405 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0]
[16:59:15] <shinken-wm>	 PROBLEM - Puppet run on tools-worker-1019 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]
[16:59:33] <Krenair>	 nothing in here, -tech, -operations, -dev or anywhere more commonly useful though
[17:00:14] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-lighttpd-1414 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0]
[17:00:16] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-lighttpd-1406 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [0.0]
[17:00:28] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1211 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]
[17:00:30] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-lighttpd-1407 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0]
[17:01:05] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-generic-1401 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0]
[17:01:05] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-lighttpd-1402 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0]
[17:01:32] <greg-g>	 should the /topic be changed until after the upgrade? :)
[17:01:37] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1213 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0]
[17:01:53] <ShakespeareFan00>	 Krenair: Something that doesn't yet exist on Quarry is the ability to let other people 'update' your query but not change it
[17:02:03] <ShakespeareFan00>	 (I.e a "Refresh" button...)
[17:02:07] <Krenair>	 ShakespeareFan00, refresh results?
[17:02:20] <ShakespeareFan00>	 Depending on the query size , yes
[17:02:23] <Krenair>	 could open a feature suggestion ticket
[17:02:26] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1404 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0]
[17:02:34] <ShakespeareFan00>	 Krenair:  Will consider that
[17:02:38] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-lighttpd-1207 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0]
[17:02:46] <shinken-wm>	 PROBLEM - Puppet run on tools-exec-1408 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0]
[17:02:49] <ShakespeareFan00>	 Most of the queries I've written are replacemennts for ones I used to use at WP:DBR
[17:03:03] <ShakespeareFan00>	 or CATSCAN
[17:04:11] <ShakespeareFan00>	 Also another feature I might suggest is to limit query to a subset of the database, like pages that are known to have been edited in the last 24hours or so
[17:04:44] <ShakespeareFan00>	 I'm not sure if the replication can do 'journaling' like that though
[17:04:49] <Krenair>	 incoming RECOVERY storm I think
[17:05:18] <Krenair>	 ShakespeareFan00, um, that sounds like you want it writing SQL for you?
[17:05:36] <Krenair>	 Quarry takes SQL and executes it on the replica server for you...
[17:05:38] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-generic-1403 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0]
[17:05:40] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-lighttpd-1413 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0]
[17:06:04] <shinken-wm>	 PROBLEM - Puppet run on tools-webgrid-generic-1404 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0]
[17:06:19] <ShakespeareFan00>	 Krenair: I'm not sure it's possible to do a query based on timestamps directly from the page table?
[17:06:32] <ShakespeareFan00>	 If it is, then it just another WHERE clause
[17:07:02] <Krenair>	 ShakespeareFan00, you want the page_touched field
[17:07:07] <ShakespeareFan00>	 Thanks
[17:07:07] <Krenair>	 I think
[17:07:27] <Krenair>	 or I suppose you could join revision
[17:07:41] <ShakespeareFan00>	 That too
[17:08:13] <ShakespeareFan00>	 Searching revisions though EATS server time :)
[17:11:16] <Guest18760>	 revision_userindex? probably not useful for this
[17:12:23] <Luke081515>	 Guest18760: want to identify, yuvi? :D
[17:12:25] <ShakespeareFan00>	 Guest18760: It was about limitiing a query to a specifc time range
[17:12:53] <shinken-wm>	 RECOVERY - Puppet run on tools-elastic-03 is OK: OK: Less than 1.00% above the threshold [0.0]
[17:13:03] <ShakespeareFan00>	 This could be done straight in the SQL using a suitable clause... or alterntively there could be subsstes of the db that could used...
[17:13:45] <ShakespeareFan00>	 Running through the entire db of a project to find 5 or 6 edits in the last 24 hours eeemed to be overkill
[17:13:51] <ShakespeareFan00>	 *seemed
[17:14:01] <Krenair>	 oh there you are
[17:14:16] <Krenair>	 ShakespeareFan00, what do you mean by subsets of the DB?
[17:14:45] <ShakespeareFan00>	 Krenair:  A Smaller set of data representing activity over say the last 24 hours or so
[17:14:51] <Krenair>	 because it sounds to me like you want to reimplement half of SQL and/or mysql
[17:14:59] <ShakespeareFan00>	 Ah OK
[17:15:10] <ShakespeareFan00>	 I obviously don't understand how SQL works :(
[17:15:10] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-1206 is OK: OK: Less than 1.00% above the threshold [0.0]
[17:15:14] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-1407 is OK: OK: Less than 1.00% above the threshold [0.0]
[17:15:15] <shinken-wm>	 RECOVERY - Puppet run on tools-precise-dev is OK: OK: Less than 1.00% above the threshold [0.0]
[17:15:22] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-1403 is OK: OK: Less than 1.00% above the threshold [0.0]
[17:15:23] <shinken-wm>	 RECOVERY - Puppet run on tools-webgrid-lighttpd-1206 is OK: OK: Less than 1.00% above the threshold [0.0]
[17:15:25] <Krenair>	 Quarry takes your SQL query and gives it to MySQL
[17:15:26] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-1210 is OK: OK: Less than 1.00% above the threshold [0.0]
[17:15:27] <shinken-wm>	 RECOVERY - Puppet run on tools-flannel-etcd-03 is OK: OK: Less than 1.00% above the threshold [0.0]
[17:15:30] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-1401 is OK: OK: Less than 1.00% above the threshold [0.0]
[17:15:36] <ShakespeareFan00>	 and MY SQL optomises the query?
[17:15:41] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-1402 is OK: OK: Less than 1.00% above the threshold [0.0]
[17:15:43] <shinken-wm>	 RECOVERY - Puppet run on tools-proxy-02 is OK: OK: Less than 1.00% above the threshold [0.0]
[17:15:43] <ShakespeareFan00>	 (using indexing or something)?
[17:15:46] <Krenair>	 yes
[17:15:49] <shinken-wm>	 RECOVERY - Puppet run on tools-webgrid-lighttpd-1405 is OK: OK: Less than 1.00% above the threshold [0.0]
[17:16:01] <Luke081515>	 ah :D
[17:16:02] <ShakespeareFan00>	 OKay
[17:16:04] <ShakespeareFan00>	 Sorry
[17:16:06] <yuvipanda>	 that's better
[17:16:19] <ShakespeareFan00>	 yuvipanda:  Do you read phab-tickets?
[17:16:30] <yuvipanda>	 halfak godog I didn't know grafana used POSTs for dashboards.
[17:16:39] <yuvipanda>	 I just woke up, so not yet
[17:17:03] <shinken-wm>	 RECOVERY - Puppet run on tools-redis-1002 is OK: OK: Less than 1.00% above the threshold [0.0]
[17:17:17] <yuvipanda>	 I setup redirects to not break it
[17:17:31] <shinken-wm>	 RECOVERY - Puppet run on tools-prometheus-01 is OK: OK: Less than 1.00% above the threshold [0.0]
[17:17:37] <shinken-wm>	 RECOVERY - Puppet run on tools-bastion-05 is OK: OK: Less than 1.00% above the threshold [0.0]
[17:17:39] <shinken-wm>	 RECOVERY - Puppet run on tools-checker-02 is OK: OK: Less than 1.00% above the threshold [0.0]
[17:17:48] <yuvipanda>	 godog is there a way to have grafana use GETs?
[17:17:49] <ShakespeareFan00>	 Krenair: Sorry
[17:18:01] <shinken-wm>	 RECOVERY - Puppet run on tools-mail-01 is OK: OK: Less than 1.00% above the threshold [0.0]
[17:18:07] <ShakespeareFan00>	 (I must sound clueless at times.)
[17:18:15] <shinken-wm>	 RECOVERY - Puppet run on tools-flannel-etcd-01 is OK: OK: Less than 1.00% above the threshold [0.0]
[17:18:25] <Krenair>	 for what?
[17:18:32] <andrewbogott>	 ok, possible labs network interruption coming up
[17:18:53] <ShakespeareFan00>	 Krenair: I'm sorry for sounding clueless
[17:19:11] <Krenair>	 I don't mind
[17:19:15] <ShakespeareFan00>	 and coming up with ideas that anyone that actually understood stuff wouldn't ask about
[17:19:23] <shinken-wm>	 RECOVERY - Puppet run on tools-docker-test-05 is OK: OK: Less than 1.00% above the threshold [0.0]
[17:19:24] <bd808>	 yuvipanda: I bet the POST is built in due to worries about url length
[17:19:45] <shinken-wm>	 RECOVERY - Puppet run on tools-worker-1001 is OK: OK: Less than 1.00% above the threshold [0.0]
[17:19:49] <shinken-wm>	 RECOVERY - Puppet run on tools-services-02 is OK: OK: Less than 1.00% above the threshold [0.0]
[17:19:51] <shinken-wm>	 RECOVERY - Puppet run on tools-services-01 is OK: OK: Less than 1.00% above the threshold [0.0]
[17:19:57] <godog>	 yuvipanda: no idea :(
[17:20:14] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-1410 is OK: OK: Less than 1.00% above the threshold [0.0]
[17:20:24] <ShakespeareFan00>	 yuvipanda:  I filed a phab-ticket about this - https://quarry.wmflabs.org/query/6052 
[17:20:27] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-1214 is OK: OK: Less than 1.00% above the threshold [0.0]
[17:20:31] <ShakespeareFan00>	 I'm seeing "phantoms"
[17:20:35] <shinken-wm>	 RECOVERY - Puppet run on tools-webgrid-lighttpd-1403 is OK: OK: Less than 1.00% above the threshold [0.0]
[17:21:03] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-1201 is OK: OK: Less than 1.00% above the threshold [0.0]
[17:21:11] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032] Fix this so it correctly says who the user is changing the patch [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/302416 (https://phabricator.wikimedia.org/T141329) (owner: 10Paladox)
[17:21:21] <shinken-wm>	 RECOVERY - Puppet run on tools-flannel-etcd-02 is OK: OK: Less than 1.00% above the threshold [0.0]
[17:21:29] <shinken-wm>	 RECOVERY - Puppet run on tools-worker-1017 is OK: OK: Less than 1.00% above the threshold [0.0]
[17:21:33] <shinken-wm>	 RECOVERY - Puppet run on tools-k8s-etcd-01 is OK: OK: Less than 1.00% above the threshold [0.0]
[17:21:38] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-1406 is OK: OK: Less than 1.00% above the threshold [0.0]
[17:21:38] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-1217 is OK: OK: Less than 1.00% above the threshold [0.0]
[17:21:48] <shinken-wm>	 RECOVERY - Puppet run on tools-docker-test-04 is OK: OK: Less than 1.00% above the threshold [0.0]
[17:22:00] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-1409 is OK: OK: Less than 1.00% above the threshold [0.0]
[17:22:10] <shinken-wm>	 RECOVERY - Puppet run on tools-worker-1016 is OK: OK: Less than 1.00% above the threshold [0.0]
[17:22:34] <yuvipanda>	 I'm going to kill shinken-wm
[17:22:34] <shinken-wm>	 RECOVERY - Puppet run on tools-webgrid-lighttpd-1412 is OK: OK: Less than 1.00% above the threshold [0.0]
[17:22:50] <shinken-wm>	 RECOVERY - Puppet run on tools-mail is OK: OK: Less than 1.00% above the threshold [0.0]
[17:23:06] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-1216 is OK: OK: Less than 1.00% above the threshold [0.0]
[17:23:40] <shinken-wm>	 RECOVERY - Puppet run on tools-k8s-master-01 is OK: OK: Less than 1.00% above the threshold [0.0]
[17:23:41] <shinken-wm>	 RECOVERY - Puppet run on tools-bastion-02 is OK: OK: Less than 1.00% above the threshold [0.0]
[17:23:42] <Krenair>	 yuvipanda, could just mute it in here
[17:33:02] <shinken-wm>	 RECOVERY - Puppet run on tools-exec-1219 is OK: OK: Less than 1.00% above the threshold [0.0]
[17:33:16] <shinken-wm>	 RECOVERY - Puppet run on tools-webgrid-lighttpd-1203 is OK: OK: Less than 1.00% above the threshold [0.0]
[17:33:48] <shinken-wm>	 RECOVERY - Puppet run on tools-webgrid-lighttpd-1201 is OK: OK: Less than 1.00% above the threshold [0.0]
[17:34:20] <shinken-wm>	 RECOVERY - Puppet run on tools-webgrid-lighttpd-1411 is OK: OK: Less than 1.00% above the threshold [0.0]
[17:34:22] <shinken-wm>	 RECOVERY - Puppet run on tools-webgrid-lighttpd-1410 is OK: OK: Less than 1.00% above the threshold [0.0]
[17:34:47] <shinken-wm>	 RECOVERY - Puppet run on tools-worker-1010 is OK: OK: Less than 1.00% above the threshold [0.0]
[17:35:07] <shinken-wm>	 RECOVERY - Puppet run on tools-webgrid-lighttpd-1208 is OK: OK: Less than 1.00% above the threshold [0.0]
[17:35:17] <shinken-wm>	 RECOVERY - Puppet run on tools-webgrid-lighttpd-1406 is OK: OK: Less than 1.00% above the threshold [0.0]
[17:36:04] <shinken-wm>	 RECOVERY - Puppet run on tools-webgrid-generic-1401 is OK: OK: Less than 1.00% above the threshold [0.0]
[17:36:05] <shinken-wm>	 RECOVERY - Puppet run on tools-webgrid-lighttpd-1402 is OK: OK: Less than 1.00% above the threshold [0.0]
[17:39:56] <yuvipanda>	 ShakespeareFan00 which ticket? I assume this is labsdb related, not entirely sure what I can do to help :(
[17:40:13] <ShakespeareFan00>	 https://phabricator.wikimedia.org/T141818 
[17:41:28] <wikibugs>	 06Labs, 10Quarry, 10DBA: Phantom entries in "Quarry" query and or labs replica of enwiki db - https://phabricator.wikimedia.org/T141818#2516050 (10yuvipanda)
[17:41:35] <yuvipanda>	 I've added the DBA tag, I guess the DBA will take a look at it when they can.
[17:46:31] <wikibugs>	 06Labs, 10Quarry, 10DBA: Phantom entries in "Quarry" query and or labs replica of enwiki db - https://phabricator.wikimedia.org/T141818#2513270 (10jcrespo) Please follow the recommendations mentioned at: https://wikitech.wikimedia.org/wiki/Help:Tool_Labs/Database/Replica_drift
[17:47:54] <wikibugs>	 06Labs, 10Quarry, 10DBA: Phantom entries in "Quarry" query and or labs replica of enwiki db - https://phabricator.wikimedia.org/T141818#2516093 (10jcrespo)
[17:47:58] <wikibugs>	 06Labs, 10DBA: Labs database replica drift - https://phabricator.wikimedia.org/T138967#2516091 (10jcrespo)
[17:50:11] <wikibugs>	 06Labs, 10Quarry, 10DBA: Phantom entries in "Quarry" query and or labs replica of enwiki db - https://phabricator.wikimedia.org/T141818#2516102 (10ShakespeareFan00)
[17:50:15] <wikibugs>	 06Labs, 10DBA: Labs database replica drift - https://phabricator.wikimedia.org/T138967#2516103 (10ShakespeareFan00)
[17:52:02] <wikibugs>	 06Labs, 10DBA: Labs database replica drift - https://phabricator.wikimedia.org/T138967#2415416 (10ShakespeareFan00) Notifying here concerning https://phabricator.wikimedia.org/T141818
[17:56:38] <wikibugs>	 06Labs, 10DBA: Labs database replica drift - https://phabricator.wikimedia.org/T138967#2516123 (10jcrespo) May I ask you to please copy the query here to check the drift? We keep here the older subtasks only because they are older than this one, but following a single format will help speed up the resolution o...
[18:32:31] <yuvipanda>	 milimetric meeting?
[18:33:21] <greg-g>	 is the labs upgrade done done?
[18:38:55] <yuvipanda>	 milimetric lost video, refreshing
[19:06:02] <shinken-wm>	 PROBLEM - SSH on tools-merlbot-proxy is CRITICAL: Server answer
[19:06:35] <grrrit-wm>	 (03Merged) 10jenkins-bot: Fix this so it correctly says who the user is changing the patch [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/302416 (https://phabricator.wikimedia.org/T141329) (owner: 10Paladox)
[19:08:57] <wikibugs>	 06Labs, 10Tool-Labs: /home/ilya missing replica.my.cnf - https://phabricator.wikimedia.org/T140592#2516410 (10intracer)
[19:13:17] <wm-bot>	 I am running http://meta.wikimedia.org/wiki/WM-Bot version wikimedia bot v. 2.8.0.0 [libirc v. 1.0.3] my source code is licensed under GPL and located at https://github.com/benapetr/wikimedia-bot I will be very happy if you fix my bugs or implement new features
[19:13:17] <tom29739>	 @help
[19:14:07] <tom29739>	 We need better spam prevention rather than +t
[19:14:32] <tom29739>	 +t just inconveniences everyone who isn't an op from setting the topic
[19:16:22] <matanya>	 yuvipanda: poky poke
[19:17:21] <yuvipanda>	 matanya just ask :)
[19:17:29] <matanya>	 pm you
[19:17:30] <yuvipanda>	 (I'm going afk shortly, but others might be able to help you)
[19:18:17] <paladox>	 yuvipanda could you deploy  https://gerrit.wikimedia.org/r/302416 please
[19:18:19] <paladox>	 it merged
[19:18:20] <paladox>	 ?
[19:22:05] <wikibugs>	 06Labs, 10Tool-Labs: Write diamond collector for gridengine job count stats - https://phabricator.wikimedia.org/T140999#2516497 (10yuvipanda) This doesn't seem to be run by diamond properly, I'll take a look in a bit.
[19:23:05] <yuvipanda>	 paladox I have to go somehwere to get my laptop charger back just now, I'll deploy in a few hours!
[19:23:19] <paladox>	 Ok
[19:23:21] <paladox>	 thanks
[19:24:17] <matanya>	 andrewbogott: planning on a rights review for wikitech at some point ?
[19:27:07] <wikibugs>	 06Labs: Track labs instances hanging - https://phabricator.wikimedia.org/T141673#2516513 (10chasemp)
[19:30:17] <paladox>	 !log git adding Alex Monk (Krenair) to project
[19:30:21] <labs-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Git/SAL, Master
[19:31:02] <shinken-wm>	 RECOVERY - SSH on tools-merlbot-proxy is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0)
[19:35:19] <shinken-wm>	 PROBLEM - Puppet staleness on tools-merlbot-proxy is CRITICAL: CRITICAL: 12.50% of data above the critical threshold [43200.0]
[19:40:21] <shinken-wm>	 RECOVERY - Puppet staleness on tools-merlbot-proxy is OK: OK: Less than 1.00% above the threshold [3600.0]
[19:49:06] <andrewbogott>	 matanya: do you have something specific in mind?
[19:50:32] <matanya>	 andrewbogott: https://wikitech.wikimedia.org/wiki/Special:ListUsers/sysop
[19:52:24] <Krenair>	 Eloquence, LVilla (WMF), Sumanah and WikiSysop?
[19:52:38] <Krenair>	 and Coren?
[19:59:37] <matanya>	 yes, that
[20:31:28] <andrewbogott>	 what is 'WikiSysop'?
[20:32:40] <valhallasw`cloud>	 andrewbogott: https://wikitech.wikimedia.org/wiki/OpenStack#Create_a_wiki_user_for_yourself.2C_and_promote_it_using_WikiSysop? a user name, I think
[20:33:44] <andrewbogott>	 yeah, so it's a special user for bootstrapping… not obvious to me that it shouldn't have admin rights
[20:34:46] <Krenair>	 I think it's historical
[20:35:56] <paladox>	 valhallasw`cloud Hi im wondering if this will fix the grrrit-bot https://gerrit.wikimedia.org/r/#/c/302416/
[20:35:59] <Krenair>	 from before my time
[20:36:10] <andrewbogott>	 Krenair: mine too apparently
[20:36:14] <valhallasw`cloud>	 paladox: yuvipanda already merged that?
[20:36:31] <paladox>	 Yeh
[20:36:46] <paladox>	 But hasent been deployed
[20:36:47] <paladox>	 yet
[20:36:51] * Krenair is digging through core history
[20:36:55] <paladox>	 im just wondering did i do the correct fix.
[20:37:09] <paladox>	 valhallasw`cloud ^^
[20:37:10] <valhallasw`cloud>	 paladox: I have no clue.
[20:37:13] <paladox>	 Ok
[20:37:45] <paladox>	 valhallasw`cloud im wondering could you deploy it please
[20:37:45] <paladox>	 ?
[20:37:56] <valhallasw`cloud>	 No.
[20:38:03] <paladox>	 ok
[20:38:35] <valhallasw`cloud>	 deploying and possibly reverting grrrit-wm is not something I'm comfortable with.
[20:42:40] <Krenair>	 https://www.mediawiki.org/wiki/Special:Code/MediaWiki/69518
[20:43:01] <Krenair>	 it was also used in some old parser tests and things
[20:43:23] <Krenair>	 probably comes from the original wikitech install
[20:44:09] <paladox>	 Oh
[20:44:12] <paladox>	 ok
[20:45:52] <shinken-wm>	 PROBLEM - Puppet run on tools-services-01 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0]
[21:20:52] <shinken-wm>	 RECOVERY - Puppet run on tools-services-01 is OK: OK: Less than 1.00% above the threshold [0.0]
[21:32:24] <halfak>	 I'm having issues with routing for ORES in labs. 
[21:32:29] <halfak>	 I think
[21:32:33] <halfak>	 https://ores.wmflabs.org/ times out
[21:32:47] <halfak>	 But if I log into our web nodes and "wget localhost:8080" everything is working fine. 
[21:33:00] <halfak>	 I see that tools.wmflabs.org is up
[21:33:04] <halfak>	 SO it's not fully proxy
[21:34:03] <yuvipanda>	 tools.wmflabs.org isn't using the general proxy mechanism
[21:34:04] <yuvipanda>	 Check quarry?
[21:34:07] <yuvipanda>	 I'll be at a laptop in a few mins
[21:34:18] <halfak>	 Ahh yeah.  Same story for quarry
[21:34:24] <halfak>	 chasemp, ^ 
[21:34:44] <halfak>	 Looks like the DNS proxy might have fallen over
[21:35:22] <chasemp>	 hm andrewbogott about^?
[21:35:41] <yuvipanda>	 novaproxy-01
[21:36:22] <chasemp>	 yuvipanda: yeah it seems novaproxy is out atm
[21:36:32] <yuvipanda>	 Is the instance up?
[21:36:45] <yuvipanda>	 (am 3min from office now)
[21:37:20] <Krenair>	 Looking
[21:37:30] <chasemp>	 I tihnk the instance may be down
[21:38:01] <Krenair>	 I can ping the instance
[21:38:05] <yuvipanda>	 Console log?
[21:38:07] <chasemp>	 I can't ssh anyway
[21:38:12] <yuvipanda>	 I wonder if it got hit by the freezes
[21:38:35] <chasemp>	 yep
[21:38:36] <andrewbogott>	 'the instance' is the proxy instance or just halfak's instance?
[21:38:36] <Krenair>	 Salt?
[21:38:38] <chasemp>	 seems like similar issue
[21:38:40] <Krenair>	 proxy andrewbogott 
[21:38:48] <chasemp>	 andrewbogott: seems novaproxy-01 wigged out
[21:38:52] <chasemp>	 I'm going to reboot
[21:38:59] <Krenair>	 chasemp, check salt first
[21:39:02] <chasemp>	 we have a seemingly serious io issue
[21:39:05] <chasemp>	 check salt for what?
[21:39:08] <andrewbogott>	 chasemp: ok, that's what I would do
[21:39:14] <Krenair>	 to see if you can get in that way and find the issue
[21:39:37] <chasemp>	 salt is a pretty useless diagnostic tool here, I'm going to reboot to restore the outage and then look at it
[21:39:42] <chasemp>	 but best thing we can do is get the console going
[21:39:57] <Krenair>	 fine, but salt will let you run commands on the instance
[21:40:03] <Krenair>	 unless it's really dead
[21:40:13] <chasemp>	 sure but what commands do you want to see?
[21:40:33] <chasemp>	 I'm not being obtuse here I have no clue what salt things you are thinking
[21:40:44] <chasemp>	 and atm the outage outways any I can think of
[21:40:50] <Krenair>	 just reboot it
[21:40:52] <yuvipanda>	 ok am on
[21:41:07] <chasemp>	 it's rebooting
[21:41:09] <yuvipanda>	 ok
[21:41:27] <yuvipanda>	 krenair 100% of the time with instance freezes salt has been useless, and I've had a max of maybe 3-4 times ever salt has been useful.
[21:42:06] <Krenair>	 is the instance frozen if you can ping it?
[21:42:19] <yuvipanda>	 yes because it doesn't do anything else useful.
[21:42:35] <chasemp>	 labvirt1001.eqiad.wmnet too
[21:42:52] <yuvipanda>	 right
[21:43:00] <chasemp>	 well, frozen is a pretty useless and broad idea
[21:43:06] <chasemp>	 nodes are failing at various levels of io issues (possibly)
[21:43:13] <chasemp>	 but 
[21:43:16] <chasemp>	 [11409409.151647] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[21:43:16] <chasemp>	 [11409409.152879] INFO: task cron:14373 blocked for more than 120 seconds.
[21:43:17] <yuvipanda>	 chasemp did you see what was on teh console log before you rebooted?
[21:43:21] <chasemp>	 yes
[21:43:23] <yuvipanda>	 right
[21:43:46] <yuvipanda>	 https://tools.wmflabs.org/nagf/?project=project-proxy you can also see it stops sending stuff to graphite
[21:43:57] <chasemp>	 [11409409.150117] INFO: task cron:14372 blocked for more than 120 seconds.
[21:43:57] <chasemp>	 [11409409.151003]       Not tainted 3.16.0-4-amd64 #1
[21:43:58] <chasemp>	 [11409409.151647] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[21:43:58] <chasemp>	 [11409409.152879] INFO: task cron:14373 blocked for more than 120 seconds.
[21:43:58] <chasemp>	 [11409409.153825]       Not tainted 3.16.0-4-amd64 #1
[21:43:58] <chasemp>	 [11409409.155729] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[21:43:59] <chasemp>	 [11409409.156902] INFO: task cron:14374 blocked for more than 120 seconds.
[21:43:59] <chasemp>	 [11409409.157905]       Not tainted 3.16.0-4-amd64 #1
[21:43:59] <chasemp>	 [11409409.159943] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[21:44:00] <chasemp>	 a cron job locked up
[21:44:00] <chasemp>	 conceivably on io
[21:44:00] <yuvipanda>	 let me look at io
[21:44:14] <yuvipanda>	 that's cron itself no, rather than a cronjob
[21:44:42] <yuvipanda>	 I see no io spikes
[21:44:53] <yuvipanda>	 https://graphite-labs.wikimedia.org/render/?width=588&height=310&_salt=1470174276.01&target=project-proxy.novaproxy-01.iostat.vda.io
[21:45:25] <yuvipanda>	 halfak it should be back now
[21:45:33] <chasemp>	 nothing that means anything to me
[21:45:33] <halfak>	 Services are recovering. 
[21:45:36] <yuvipanda>	 yeah
[21:45:36] <halfak>	 Thanks folks
[21:45:45] <yuvipanda>	 np thanks for bringing it to our attention, halfak
[21:45:48] <halfak>	 :D
[21:46:03] <chasemp>	 so thoroughly confusing, but we seem to have a resource contention issue(s)
[21:47:14] <yuvipanda>	 possibly.
[21:47:24] * yuvipanda looks at ganglia for labvirt servers
[21:48:27] <bd808>	 !log tools.stashbot Updated to  21ca5f1
[21:48:30] <labs-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.stashbot/SAL, Master
[21:49:09] <yuvipanda>	 can't really see anything useful there
[21:49:30] <chasemp>	 yeah, it's an artificial contention I think
[21:50:20] <wikibugs>	 06Labs: Track labs instances hanging - https://phabricator.wikimedia.org/T141673#2516967 (10yuvipanda)
[21:50:26] <yuvipanda>	 ok, I updated ^
[21:52:31] <chasemp>	 21:42:50 I see  procs go into d-wait
[21:52:53] <chasemp>	   532 root         1   0.00s    0.00s 25804K   2916K  1528K      0K N-    - D      0   2%  nginx
[21:52:53] <chasemp>	   533 root         1   0.00s    0.00s 20172K   3396K  1376K      0K N-    - D      1   2%  salt-minion
[21:52:54] <chasemp>	   179 root         1   0.03s    0.00s 34376K   3680K  1196K      0K N-    - S      0   1%  systemd-udevd
[21:52:55] <chasemp>	   181 root         1   0.00s    0.00s 34376K   3644K  1124K      0K N-    - S      0   1%  systemd-udevd
[21:52:57] <chasemp>	   595 root         1   0.00s    0.01s 16568K   4288K   904K      0K N-    - R      1   1%  atop
[21:52:59] <chasemp>	   159 root         1   0.04s    0.07s 32968K   3180K   788K      0K N-    - S      1   1%  systemd-journa
[21:53:01] <chasemp>	   178 root         1   0.00s    0.00s 34744K   4068K   780K      0K N-    - S      1   1%  systemd-udevd
[21:53:03] <chasemp>	   180 root         1   0.00s    0.00s 34376K   3448K   780K      0K N-    - S      1   1%  systemd-udevd
[21:53:05] <chasemp>	   598 root         1   0.00s    0.00s 21032K   3012K   780K      0K N-    - D      1   1%  ntpd
[21:53:07] <chasemp>	   534 diamond      1   0.00s    0.00s 20172K   3420K   756K      0K N-    - D      0   1%  diamond
[21:53:09] <chasemp>	   549 root         1   0.00s    0.00s 39744K   2076K   744K      0K N-    - D      0   1%  dbus-daemon
[21:53:37] <yuvipanda>	 chasemp pastebin :P
[21:53:39] <yuvipanda>	 it's all garbled...
[21:53:57] <chasemp>	 then again it jumps from 21:20 to 21:42
[21:54:07] <chasemp>	 missed teh 21:30 collection 
[21:55:24] <yuvipanda>	 chasemp is that atop output?
[21:55:28] <chasemp>	 it is
[21:55:39] <yuvipanda>	 also can you put that in a pastebin so I can read it?
[21:55:48] <yuvipanda>	 pasting things onto IRC directly almost always garbles them and loses tabs / spaces / etc
[21:56:12] <chasemp>	 var/log/atop# atop -r atop_20160802 -b 20:00
[21:56:15] <chasemp>	 sure
[21:56:17] <wikibugs>	 06Labs: Track labs instances hanging - https://phabricator.wikimedia.org/T141673#2517006 (10yuvipanda)
[21:56:19] <yuvipanda>	 ah, thanks
[21:56:40] <chasemp>	 "t" to jump to next collection
[21:57:18] <chasemp>	 cron indeed spiked in io tho
[21:58:37] <chasemp>	 not much of a spike but still either coincidentally it was caught or the surge itself triggered
[21:59:55] <chasemp>	 yeah seems like not a spike just a recognition of it's blocking state maybe
[22:00:27] <yuvipanda>	 I also see
[22:00:34] <yuvipanda>	 super high ksoftirqd usage
[22:00:38] <yuvipanda>	 which I was seeing in a lot of places
[22:00:47] <yuvipanda>	 >     3 root      20   0       0      0      0 R   5.3  0.0   0:26.62 ksoftirqd/0                                               
[22:00:58] <chasemp>	 it's hard to know what to make of that, cause or effect
[22:01:03] <yuvipanda>	 4-5% constant usage
[22:01:12] <yuvipanda>	 well, upgrading kernel got rid of it on most of the worker nodes
[22:01:22] <yuvipanda>	 and in talking to moritzm the next step for us was to install irqbalance if it came back
[22:01:47] <yuvipanda>	 novaproxy-01 is on a old kernel, so I'm tempted to upgrade it, but that's *maybe* unrelated to what happened to it, but I'm not entirely sure
[22:01:58] <chasemp>	 has this happened on any nodes where we upgrade the kernel and we see ksoftirqd usage drop?
[22:02:08] <chasemp>	 I thought it had
[22:02:22] <chasemp>	 it's jessie also
[22:02:27] <yuvipanda>	 chasemp can you rephrase? I don't understand what that sentence meant
[22:02:28] <chasemp>	 which is even more interesting then
[22:02:33] <yuvipanda>	 what does 'this' refer to?
[22:03:00] <chasemp>	 lockup, asking if we have seen the lockup issue post-kernel update where ksoftirqd usage drops
[22:03:19] <yuvipanda>	 yup
[22:03:27] <yuvipanda>	 so I think they're unrelated
[22:03:31] <chasemp>	 yeah
[22:03:34] <yuvipanda>	 so probably shouldn't do it now, confounding effects.
[22:03:57] <chasemp>	 one thing k8s has in common with these nodes and uncommon w/ the rest of tools is
[22:03:58] <chasemp>	 jessie
[22:04:10] <chasemp>	 the thread seems to carry through and I don't knwo what to make of it yet
[22:04:34] <yuvipanda>	 so deployment-logstash2 I wonder, if it was jessie or trusty
[22:04:39] <yuvipanda>	 since it was stuck with very similar symptoms
[22:04:40] <yuvipanda>	 looking
[22:04:56] <chasemp>	 it is jessie
[22:05:32] <yuvipanda>	 yeah
[22:06:09] <wikibugs>	 06Labs: Track labs instances hanging - https://phabricator.wikimedia.org/T141673#2517136 (10yuvipanda)
[22:06:11] * yuvipanda adds 'OS' to the table
[22:06:11] <chasemp>	 So maybe the common thing isn't k8s (as we puzzled over) but instead Jessie
[22:07:14] <chasemp>	 so we need console, and to catch this in teh act, and to loop in one of the debian guys
[22:07:17] <chasemp>	 atm
[22:07:46] <chasemp>	 unless we can come up with a non-jessie outlier and not that it's scientific but sure seems consistent
[22:09:14] <yuvipanda>	 yah
[22:10:10] <yuvipanda>	 chasemp so next step now is to just get console access working?
[22:10:19] <chasemp>	 I wish google had a "show me results from last year"
[22:10:33] <yuvipanda>	 chasemp can you write out your thinking on https://phabricator.wikimedia.org/T141673? I'll figure out a way to make sure we have appropriate monitoring for the novaproxy
[22:12:19] <chasemp>	 yuvipanda: I will yep, kind of in an awkward place atm and need to close up but I'm making my notes and will update first thing am or when i can come back
[22:12:27] <chasemp>	 also doing a bit of digging before 
[22:12:33] <yuvipanda>	 chasemp cool.
[22:12:43] <yuvipanda>	 I'll dig into monitoring
[22:14:10] <chasemp>	 holy crap there is a lot of logging to qemu
[22:15:18] <yuvipanda>	 oh?
[22:15:42] <chasemp>	 nothing useful it seems
[22:17:30] <tom29739>	 yuvipanda, andrewbogott, would it be OK to have Sigyn in here? It's a bot from freenode that k lines spammers. (Need to get approval from chan ops to have it in here)
[22:18:06] <yuvipanda>	 paladox I deployed your change, test?
[22:18:12] <paladox>	 ok thanks
[22:18:22] <yuvipanda>	 tom29739 assuming that's all it does, sure! who runs it?
[22:18:38] <paladox>	 yuvipanda could you restart the grrrit-wm bot please
[22:18:40] <tom29739>	 yuvipanda: freenode
[22:18:43] <paladox>	 since gerrit was restarted
[22:18:45] <paladox>	 but not the bot
[22:18:50] <paladox>	 please
[22:18:51] <paladox>	 >
[22:18:52] <paladox>	 ?
[22:19:00] <yuvipanda>	 paladox I just restarted the bot
[22:19:05] <paladox>	 Oh
[22:19:07] <paladox>	 ok 
[22:19:09] <paladox>	 thanks
[22:19:28] <paladox>	 It seems the bot left
[22:19:29] <paladox>	 ?
[22:19:32] <legoktm>	 wait gah
[22:19:35] <legoktm>	 I just restarted it too
[22:19:37] <legoktm>	 sorry
[22:19:38] <paladox>	 Oh
[22:20:11] <legoktm>	 !log tools.lolrrit-wm re-re-started grrrit-wm
[22:20:15] <labs-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.lolrrit-wm/SAL, Master
[22:20:17] <paladox>	 No it dosent seemed to have worked
[22:20:27] <tom29739>	 paladox: it came back
[22:20:33] <paladox>	 Yep
[22:21:50] <yuvipanda>	 thanks legoktm :)
[22:22:46] <paladox>	 yuvipanda it seems the bot is not working
[22:22:47] <paladox>	 any more
[22:23:10] <paladox>	 i tested with comment and uploading patches, ive only touched uploading patches part not comments
[22:25:20] <wikibugs>	 06Labs: Track labs instances hanging - https://phabricator.wikimedia.org/T141673#2517277 (10Andrew)
[22:28:59] <grrrit-wm>	 (03Draft2) 10Paladox: Another fix so that it will tell you the correct user [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/302617 
[22:29:02] <grrrit-wm>	 (03Draft1) 10Paladox: Another fix so that it will tell you the correct user [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/302617 
[22:31:26] <yuvipanda>	 paladox what's your wikitech username?
[22:31:34] <paladox>	 paladox
[22:31:46] <yuvipanda>	 ah you aren't a memeber of the tools project yet
[22:32:14] <paladox>	 nope
[22:32:38] <yuvipanda>	 !log tools added paladox to tools
[22:32:42] <labs-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master
[22:32:45] <paladox>	 thanks
[22:32:52] <yuvipanda>	 !log tools.lolrrit-wm added paladox to tool
[22:32:55] <labs-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.lolrrit-wm/SAL, Master
[22:32:59] <paladox>	 thanks
[22:33:38] <yuvipanda>	 paladox https://wikitech.wikimedia.org/wiki/Grrrit-wm has info
[22:33:48] <paladox>	 Yep thankyou
[22:34:00] <yuvipanda>	 paladox basically, the code is in ~/lolrrit-wm - so you can cherry pick your patch there, and restart the service based on instructions there and see how it goes
[22:34:11] <paladox>	 oh thanks
[22:34:17] <paladox>	 how do i cherry pick
[22:34:24] <paladox>	 do i just do the gerrit git checkout
[22:34:43] <paladox>	 For example like this
[22:34:44] <paladox>	 git fetch ssh://paladox@gerrit.wikimedia.org:29418/operations/puppet refs/changes/01/302601/3 && git checkout FETCH_HEAD
[22:34:52] <paladox>	 It would be exactly that command
[22:34:54] <paladox>	 but similar
[22:34:56] <paladox>	 ?
[22:35:48] <yuvipanda>	 paladox yes, except 'git cherry-pick' instead of 'git checkout' in the end
[22:35:57] <paladox>	 Ok
[22:36:03] <paladox>	 How do i ssh into it please
[22:36:04] <paladox>	 ?
[22:37:14] <tom29739>	 paladox: to get to tools it's "ssh tools-bastion-03" from the main labs bastion
[22:37:22] <paladox>	 Oh
[22:37:34] <tom29739>	 Then "become <tool name>"
[22:37:38] <paladox>	 Something like
[22:37:38] <paladox>	 Host gerrit-test3
[22:37:39] <paladox>	 ProxyCommand ssh -a -W %h:%p -A paladox@bastion3.wmflabs.org
[22:37:39] <paladox>	 User paladox
[22:37:42] <paladox>	 oh
[22:38:06] <tom29739>	 Or just direct ssh to login.tools.wmflabs.org
[22:38:14] <paladox>	 Ok
[22:39:24] * tom29739 usually treats tools and the rest of labs as 2 different worlds and has separate shell windows
[22:39:31] <paladox>	 lol
[22:40:06] <paladox>	 tom29739 how i get into tools.lolrrit-wm
[22:40:08] <paladox>	 please
[22:40:31] <paladox>	 Never mind
[22:40:33] <paladox>	 managed it
[22:40:39] <paladox>	 become lolrrit-wm
[22:40:51] <tom29739>	 That's right
[22:41:04] <paladox>	 Ok im applying it now
[22:41:08] <paladox>	 now how do i log it
[22:41:10] <paladox>	 please
[22:41:11] <paladox>	 ?
[22:41:18] <madhuvishy>	 !log Depooling tools-worker 1012 and 1013 for T141126
[22:41:19] <stashbot>	 T141126: Investigate moving docker to use direct-lvm devicemapper storage driver - https://phabricator.wikimedia.org/T141126
[22:41:19] <labs-morebots>	 Depooling is not a valid project.
[22:41:20] <tom29739>	 paladox: !log
[22:41:30] <paladox>	 Oh i mean for tools
[22:41:30] <madhuvishy>	 !log tools Depooling tools-worker 1012 and 1013 for T141126
[22:41:31] <stashbot>	 T141126: Investigate moving docker to use direct-lvm devicemapper storage driver - https://phabricator.wikimedia.org/T141126
[22:41:35] <labs-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master
[22:41:41] <tom29739>	 So !log tools.<tool> message
[22:41:49] <tom29739>	 paladox: ^
[22:42:04] <tom29739>	 That's the normal way for a tool
[22:42:08] <paladox>	 !log tools cherry picking 302617 onto lolrrit-wm
[22:42:11] <paladox>	 Ok thanks
[22:42:13] <labs-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master
[22:42:31] * tom29739 face palms
[22:42:42] <tom29739>	 paladox: that's gone into the tools log
[22:42:47] <paladox>	 Oh
[22:42:49] <paladox>	 woops
[22:42:51] <paladox>	 sorry
[22:42:59] <paladox>	 !log tools.lolrrit-wm cherry picking 302617 onto lolrrit-wm
[22:43:02] <labs-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.lolrrit-wm/SAL, Master
[22:44:00] <paladox>	 tom29739 how do i restart it
[22:44:05] <paladox>	 https://wikitech.wikimedia.org/wiki/Grrrit-wm
[22:44:12] <shinken-wm>	 PROBLEM - Host tools-worker-1012 is DOWN: CRITICAL - Host Unreachable (10.68.16.49)
[22:44:13] <paladox>	 Do i kubectl delete pod 
[22:44:13] <paladox>	 ?
[22:44:43] <yuvipanda>	 !log tools depool tools-worker-1015 for T141126
[22:44:44] <stashbot>	 T141126: Investigate moving docker to use direct-lvm devicemapper storage driver - https://phabricator.wikimedia.org/T141126
[22:44:47] <labs-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master
[22:45:15] <paladox>	 !log tools.lolrrit-wm restarting grrrit-wm bot.
[22:45:19] <labs-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.lolrrit-wm/SAL, Master
[22:45:20] <shinken-wm>	 PROBLEM - Host tools-worker-1013 is DOWN: CRITICAL - Host Unreachable (10.68.18.118)
[22:46:19] <paladox>	 yay
[22:46:21] <paladox>	 it joined
[22:46:24] <paladox>	 lets test
[22:48:32] <shinken-wm>	 PROBLEM - Host tools-worker-1015 is DOWN: CRITICAL - Host Unreachable (10.68.23.37)
[22:48:54] <paladox>	 yuvipanda the bot keeps restarting
[22:48:55] <paladox>	 ?
[22:49:01] <yuvipanda>	 paladox look at logs
[22:49:07] <yuvipanda>	 probably means there's a bug in your code
[22:49:14] <paladox>	 Ok
[22:49:32] <yuvipanda>	 !log tools depooled tools-worker-1014 as well for T141126
[22:49:34] <stashbot>	 T141126: Investigate moving docker to use direct-lvm devicemapper storage driver - https://phabricator.wikimedia.org/T141126
[22:49:37] <labs-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master
[22:51:02] <paladox>	 yuvipanda i carnt find anything in the logs causing this
[22:51:30] <paladox>	 Im going to revert
[22:51:36] <paladox>	 and try again tomarror
[22:52:15] <yuvipanda>	 ok!
[22:52:33] <paladox>	 yuvipanda do you mind me manually hacking it to revert my code
[22:52:45] <paladox>	 since it is getting late and will try again tomarror
[22:52:53] <shinken-wm>	 PROBLEM - Host tools-worker-1014 is DOWN: CRITICAL - Host Unreachable (10.68.23.152)
[22:54:53] <grrrit-wm>	 (03Abandoned) 10Paladox: Another fix so that it will tell you the correct user [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/302617 (owner: 10Paladox)
[22:56:08] <yuvipanda>	 paladox nope, feel free.
[22:56:17] <paladox>	 Thanks
[22:57:12] <paladox>	 !log tools.lolrrit-wm tempoaraily reverting two of my patches, will try and do more testing tomarror.
[22:57:16] <labs-morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.lolrrit-wm/SAL, Master
[23:06:04] <Niharika>	 yuvipanda: Maybe you can ping bd808 for the meeting if he's with you?
[23:08:58] <wikibugs>	 06Labs, 10Labs-Infrastructure, 06Operations: python-designateclient package version does not match between labtestweb2001 and silver - https://phabricator.wikimedia.org/T134543#2268964 (10AlexMonk-WMF) Update: @Andrew did this during the upgrade of Labs from Kilo to Liberty.
[23:11:51] <shinken-wm>	 PROBLEM - Puppet run on tools-services-01 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0]
[23:51:51] <shinken-wm>	 RECOVERY - Puppet run on tools-services-01 is OK: OK: Less than 1.00% above the threshold [0.0]