[00:00:52] !log depool tools-worker-1017 for T141126 [00:00:53] T141126: Investigate moving docker to use direct-lvm devicemapper storage driver - https://phabricator.wikimedia.org/T141126 [00:00:53] depool is not a valid project. [00:01:00] !log tools depool tools-worker-1017 for T141126 [00:01:01] T141126: Investigate moving docker to use direct-lvm devicemapper storage driver - https://phabricator.wikimedia.org/T141126 [00:01:04] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [00:03:07] PROBLEM - Host tools-worker-1016 is DOWN: CRITICAL - Host Unreachable (10.68.21.6) [00:55:34] 06Labs, 10Mail: failed exim service on labs instances - https://phabricator.wikimedia.org/T135033#2513832 (10Andrew) [00:55:36] 06Labs: confirm that new base labs base image is adequate for kubernetes &c. - https://phabricator.wikimedia.org/T134944#2513830 (10Andrew) 05Open>03Resolved The image named debian-8.5-jessie now has cgroups enabled on first boot. [00:55:44] 06Labs, 10Quarry: Phantom entries in "Quarry" query and or labs replica of enwiki db - https://phabricator.wikimedia.org/T141818#2513833 (10Danny_B) [01:11:21] !log deployment-prep Proper SSL certificate up at https://upload.beta.wmflabs.org - HTTP has been changed to force TLS redirect. [01:11:21] Please !log in #wikimedia-releng for beta cluster SAL [01:12:17] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/SAL, Master [02:00:30] 10Wikibugs: Wrong comment anchors linked - https://phabricator.wikimedia.org/T141837#2514045 (10Danny_B) [02:50:01] PROBLEM - Puppet staleness on tools-grid-master is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [43200.0] [02:51:11] !log deployment-prep https://deployment.wikimedia.beta.wmflabs, https://meta.wikimedia.beta.wmflabs, and their mobile variants now also have valid certs and TLS redirects. [02:51:11] Please !log in #wikimedia-releng for beta cluster SAL [02:51:16] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/SAL, Master [03:01:29] 06Labs, 10wikitech.wikimedia.org: Wikitech sign-up page has bad styling - https://phabricator.wikimedia.org/T136032#2514078 (10Krinkle) [03:05:07] 06Labs, 10Labs-Infrastructure, 10Beta-Cluster-Infrastructure, 06Operations: beta: Get SSL certificates for *.{projects}.beta.wmflabs.org - https://phabricator.wikimedia.org/T50501#2514079 (10Krenair) a:03Krenair [03:06:04] 06Labs, 10Labs-Infrastructure, 10Beta-Cluster-Infrastructure, 06Operations: beta: Get SSL certificates for *.{projects}.beta.wmflabs.org - https://phabricator.wikimedia.org/T50501#527800 (10Krenair) This is now working for meta.wikimedia.beta.wmflabs.org and deployment.wikimedia.beta.wmflabs.org (and their... [03:37:03] git-commit[1] [05:03:50] Krenair: you got certs on beta?! awesome [09:38:28] !log tools bounce morebots production [09:38:32] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [10:46:28] valhallasw`vecto: Hi im wondering if you could take a look at https://phabricator.wikimedia.org/T141329 please? [10:46:36] Im not sure if it requires a change in the bot [10:46:41] or if gerrit has a bug. [10:46:42] ? [10:48:04] https://phabricator.wikimedia.org/diffusion/TGRT/ [10:52:47] Guest18760: Hi im wondering if you could take a look at https://phabricator.wikimedia.org/T141329 please? [10:52:59] Im not sure if the bot needs updating to support changes in gerrit [10:53:03] or gerrit has a bug [10:53:03] ? [10:53:05] please [10:53:05] ? [11:13:49] PROBLEM - Puppet run on tools-services-02 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [11:48:51] RECOVERY - Puppet run on tools-services-02 is OK: OK: Less than 1.00% above the threshold [0.0] [11:52:41] (03Draft1) 10Paladox: Fix this so it correctly says who the user is changing the patch [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/302416 (https://phabricator.wikimedia.org/T141329) [11:52:45] (03Draft2) 10Paladox: Fix this so it correctly says who the user is changing the patch [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/302416 (https://phabricator.wikimedia.org/T141329) [12:17:46] (03CR) 10Paladox: "I wonder if we should do message.patchSet.uploader.name" [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/302416 (https://phabricator.wikimedia.org/T141329) (owner: 10Paladox) [13:53:07] There is a shinken alert on deployment-elastic06 (SSH not answering). I can't get console or logs on Horizon. And I can't SSH into it. [13:53:21] * gehel is a bit lost and would appreciate help from the greater minds... [13:53:53] I was asking about that yesterday [13:54:18] gehel, is it safe to reboot? [13:55:05] Krenair: should be no issue. I can try that, but I was wondering if I could collect any info before the reboot [13:55:26] Well [13:56:02] you have ops rights in prod [13:56:11] and the server responds to ping [13:56:21] you could log into labcontrol1001.wikimedia.org and use sudo salt cmd.run [13:56:46] Krenair: thanks, I'll try that [13:57:10] `sudo salt 'deployment-elastic06.deployment-prep.eqiad.wmflabs' cmd.run $cmd` I think it'd be [13:57:45] Krenair: I have some idea about salt. Just did not know we had a saltmaster for deployment-prep... [13:57:56] Ohhhh right deployment-prep [13:57:59] Hang on [13:58:06] For most labs instances, that would be the place to use [13:58:11] For deployment-prep, we do have a specific saltmaster [13:58:40] deployment-salt02.deployment-prep.eqiad.wmflabs [13:58:42] ? [13:58:44] yes [13:59:02] krenair@deployment-salt02:~$ sudo salt 'deployment-elastic07.deployment-prep.eqiad.wmflabs' cmd.run id [13:59:02] deployment-elastic07.deployment-prep.eqiad.wmflabs: [13:59:02] uid=0(root) gid=0(root) groups=0(root) [13:59:07] same thing should work for deployment-elastic06 [13:59:34] Though it seems that it doesn't: [13:59:34] krenair@deployment-salt02:~$ sudo salt 'deployment-elastic06.deployment-prep.eqiad.wmflabs' cmd.run id [13:59:34] krenair@deployment-salt02:~$ [13:59:44] I have no other ideas [14:01:06] Krenair: Thanks for the help! I'll do the hard reboot stuff... [14:02:20] !log rebooting deployment-elastic06 (unresponsive to SSH and Salt) [14:02:21] rebooting is not a valid project. [14:02:30] !log deployment-prep rebooting deployment-elastic06 (unresponsive to SSH and Salt) [14:02:31] Please !log in #wikimedia-releng for beta cluster SAL [14:02:35] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/SAL, Master [14:02:44] damn... I'll find the right log eventually... [14:04:09] This is the right log [14:04:11] Ignore stashbot [14:04:36] deployment-prep is not special it is a labs project and labs project SALs at at Nova_Resource:$project/SAL [14:04:46] labs-morebots handles sending entries there [14:04:47] I am a logbot running on tools-exec-1220. [14:04:47] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [14:04:47] To log a message, type !log . [14:07:53] oh yeah, the other problem is that the bot usually in -releng is not currently running. [14:08:00] not sure why [14:10:19] ok, so I'm confused, but it seems to be normal :) [15:04:35] (03PS3) 10Paladox: Fix this so it correctly says who the user is changing the patch [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/302416 (https://phabricator.wikimedia.org/T141329) [15:07:16] (03PS4) 10Paladox: Fix this so it correctly says who the user is changing the patch [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/302416 (https://phabricator.wikimedia.org/T141329) [15:19:54] chasemp: https://gerrit.wikimedia.org/r/#/c/302450/ created. I added you and thcipriani for review, but I think you gave me a third person... (I closed the hangout too fast...) [15:25:03] chasemp: disregard, paravoid_ has already merged it... [15:25:10] gehel: yep cool [15:25:50] PROBLEM - Puppet run on tools-proxy-01 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [15:35:27] PROBLEM - Puppet run on tools-flannel-etcd-03 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [15:36:33] (03CR) 10Zppix: [C: 031] "+1 for the idea and it doesnt blow up gerrit so i guess it works :P" [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/302416 (https://phabricator.wikimedia.org/T141329) (owner: 10Paladox) [15:37:53] PROBLEM - Puppet run on tools-elastic-03 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [15:38:13] PROBLEM - Puppet run on tools-flannel-etcd-01 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [15:38:54] PROBLEM - Puppet run on tools-services-01 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [15:40:44] PROBLEM - Puppet run on tools-proxy-02 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [15:41:07] Puppet is broken in labs please see -operations [15:42:42] PROBLEM - Puppet run on tools-k8s-master-01 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [15:45:14] PROBLEM - Puppet run on tools-logs-01 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [15:45:20] PROBLEM - Puppet run on tools-flannel-etcd-02 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [15:45:34] PROBLEM - Puppet run on tools-k8s-etcd-01 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [15:45:50] PROBLEM - Puppet run on tools-elastic-01 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [15:47:17] PROBLEM - Puppet run on tools-k8s-etcd-02 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [15:50:57] PROBLEM - Puppet run on tools-k8s-etcd-03 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [15:51:48] o.O [15:55:04] PROBLEM - Puppet run on tools-elastic-02 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [15:58:17] o/ Grafana for labs seems to not be working. [15:58:25] It looks like I'm running into this issue: https://github.com/grafana/grafana/issues/4499 [15:59:13] It seems like everyone is saying "It's CORS" [15:59:16] Could it be CORS? [16:00:34] PROBLEM - Puppet run on tools-exec-1213 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [16:05:12] halfak: minor emergency here but can you elaborate on what not working means specifically? [16:05:21] site down, some error, some action missing [16:09:54] what's the site down? [16:10:25] grafana issues but I'm unsure where or what is all [16:20:00] chasemp, sorry I was AFK. For all panes, I'm getting "Cannot read property 'message' of null" [16:20:14] This is new as of a week or two ago. So it isn't urgent [16:21:03] This only happens for the labs graphite datasource. [16:21:09] So I think that's where the issue lies. [16:22:42] RECOVERY - Puppet run on tools-k8s-master-01 is OK: OK: Less than 1.00% above the threshold [0.0] [16:24:36] halfak: is thsi https://grafana-labs.wikimedia.org/? [16:25:14] RECOVERY - Puppet run on tools-logs-01 is OK: OK: Less than 1.00% above the threshold [0.0] [16:25:14] Oh. No, I've been using the grafana datasource in grafana.wikimedia.org [16:25:20] RECOVERY - Puppet run on tools-flannel-etcd-02 is OK: OK: Less than 1.00% above the threshold [0.0] [16:25:27] That's what I was directed to do a couple months ago. :) [16:25:36] RECOVERY - Puppet run on tools-k8s-etcd-01 is OK: OK: Less than 1.00% above the threshold [0.0] [16:25:38] E.g. https://grafana.wikimedia.org/dashboard/db/ores-labs [16:25:44] yeah I thikn yuvi and godog (?) changed it out from under you [16:25:50] RECOVERY - Puppet run on tools-elastic-01 is OK: OK: Less than 1.00% above the threshold [0.0] [16:25:51] but honestly I'm not entirely positive what the intention is atm [16:26:11] Boo. Still, that datasource should still work from grafana.wikimedia.org, right? [16:26:26] halfak: https://phabricator.wikimedia.org/T120295#2492682 [16:26:29] * halfak is not a fan of needing to change domains [16:26:49] Arg. [16:26:51] I only recall this in passing so you'll have to read up on that task w/ me on current idea [16:27:14] afaik the source hasn't been removed from production grafana [16:27:16] RECOVERY - Puppet run on tools-k8s-etcd-02 is OK: OK: Less than 1.00% above the threshold [0.0] [16:27:30] the graphite labs source that is [16:27:46] Gotcha. Thanks. So it seems that this was a change that rendered a bunch of dashboards unusable and now I need to go digging for all of the places that we linked to the grafana.wikimedia.org and change it to grafana-labs.wikimedia.org. [16:27:50] looks like the url needs changed. the js is doing a post that is getting a 302 response to the new domain [16:28:01] godog, but it seems to not be working. [16:28:25] If that could be repaired, I'd appreciate keeping the dashboard where it is. [16:28:31] But if I need to move, then I need to move. [16:28:47] bd808: is it pointed at labmon instead of graphite-labs? [16:28:52] halfak: ^ I wonder [16:28:55] RECOVERY - Puppet run on tools-services-01 is OK: OK: Less than 1.00% above the threshold [0.0] [16:29:08] or just graphite.wmflabs.org I guess [16:29:13] and now it's graphite-labs.wikimedia.org [16:29:26] yeah [16:29:38] there is a redir at the old domain, but that doesn't work with post [16:29:51] that makes sense [16:30:03] RECOVERY - Puppet run on tools-elastic-02 is OK: OK: Less than 1.00% above the threshold [0.0] [16:30:12] I think we just need to fix the "wmflabs-graphite" datasource config in grafana [16:30:23] * bd808 is trying to find the right screen [16:30:59] RECOVERY - Puppet run on tools-k8s-etcd-03 is OK: OK: Less than 1.00% above the threshold [0.0] [16:31:26] godog: It seems like maybe ancillary breakage from service url change fyi [16:32:18] chasemp: yup, thanks! [16:32:42] :(( I get permission denied from https://grafana-admin.wikimedia.org/datasources [16:33:01] not sure what ldap group has the rights to change things there [16:33:13] PROBLEM - Puppet run on tools-exec-1221 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [16:33:43] yeah I get the same [16:34:12] huh [16:34:18] yeah [16:35:02] I wonder if there is a special non-ldap user with the "admin" role? [16:35:08] PROBLEM - Puppet run on tools-exec-1206 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [16:35:13] * bd808 hasn't messed with grafana much [16:35:16] PROBLEM - Puppet run on tools-precise-dev is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [16:35:17] mhh https://grafana-admin.wikimedia.org/admin/orgs/edit/1 sez a bunch of people have admin [16:35:22] PROBLEM - Puppet run on tools-exec-1202 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [16:35:50] RECOVERY - Puppet run on tools-proxy-01 is OK: OK: Less than 1.00% above the threshold [0.0] [16:36:09] all of this btw is blocked on a ldap overlay to be able to shove grafana's idea of the world into our ldap [16:37:11] halfak: have you opened a ticket? We can at least document what needs changing [16:37:33] bd808, will make one now. [16:37:47] graphite.wmflabs.org -> graphite-labs.wikimedia.org in https://grafana-admin.wikimedia.org/datasources [16:38:33] how many ops,devs, and research scientists does it take to change a grafana setting :) [16:38:39] I feel a joke coming on here but I'm not creative enough [16:38:45] ha. [16:38:53] halfak: btw in the meantime since the dashboard is broken anyway, it can be moved to grafana-labs fairly easily by downloading the json from https://grafana.wikimedia.org/api/dashboards/db/ores-labs / change the datasource and import the json into grafana-labs [16:39:20] godog: ah nice I didn't realize you could bulk export like that [16:39:22] PROBLEM - Puppet run on tools-docker-test-05 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [16:39:29] godog, yeah. Familiar with that transfer pattern. However, I'd have to change every link on the wiki to that too [16:39:42] Which would be fine if I'm only doing it once. [16:39:48] Although this is already the second move. [16:39:49] :) [16:39:52] PROBLEM - Puppet run on tools-services-01 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [16:40:14] PROBLEM - Puppet run on tools-exec-1407 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [16:40:22] PROBLEM - Puppet run on tools-webgrid-lighttpd-1206 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [16:40:24] PROBLEM - Puppet run on tools-exec-1403 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [16:40:26] PROBLEM - Puppet run on tools-exec-1210 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [16:40:27] PROBLEM - Puppet run on tools-exec-1214 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [16:40:30] PROBLEM - Puppet run on tools-exec-1401 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [16:40:36] RECOVERY - Puppet run on tools-exec-1213 is OK: OK: Less than 1.00% above the threshold [0.0] [16:40:37] PROBLEM - Puppet run on tools-worker-1003 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [16:40:37] heh, yeah I don't think grafana supports redirects for dashboards, what I did previously is add a new panel with the new url [16:40:38] PROBLEM - Puppet run on tools-exec-1402 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [16:40:49] PROBLEM - Puppet run on tools-webgrid-lighttpd-1405 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [16:41:03] PROBLEM - Puppet run on tools-exec-1201 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [16:41:03] it would be a nice feature tho [16:41:20] 06Labs, 10Labs-Infrastructure, 10Graphite: Can't use wmflabs graphite datasource in grafana.wikimedia.org - https://phabricator.wikimedia.org/T141891#2515780 (10Halfak) [16:41:23] https://phabricator.wikimedia.org/T141891 [16:41:47] PROBLEM - Puppet run on tools-docker-test-04 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [16:42:07] PROBLEM - Puppet run on tools-redis-1002 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [16:42:29] PROBLEM - Puppet run on tools-prometheus-01 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [16:42:37] PROBLEM - Puppet run on tools-bastion-05 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [16:42:39] PROBLEM - Puppet run on tools-checker-02 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [16:42:45] thanks halfak sorry for leading you down a wrong path in the beginning :) taht should be a solid url from on out [16:43:03] PROBLEM - Puppet run on tools-mail-01 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [16:43:11] OK. Do you advice that we make a switch now chasemp? [16:43:43] PROBLEM - Puppet run on tools-k8s-master-01 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [16:44:04] As I understand it graphite-labs.wikimedia.org is here to stay and we should fight to death over it from here on out [16:44:07] if I'm you I guess I would [16:44:15] Something's wrong with a query - https://quarry.wmflabs.org/query/6012 [16:44:28] Cool will do chasemp [16:44:29] It should be dropping anything that alreay has a rationale and isn't [16:44:34] Suggestions? [16:44:45] PROBLEM - Puppet run on tools-worker-1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [16:44:49] PROBLEM - Puppet run on tools-services-02 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [16:45:13] PROBLEM - Puppet run on tools-exec-gift is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [16:45:14] PROBLEM - Puppet run on tools-exec-1212 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [16:45:22] PROBLEM - Puppet run on tools-webgrid-lighttpd-1210 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [16:45:26] PROBLEM - Puppet run on tools-webgrid-lighttpd-1209 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [16:45:28] chasemp, I should be using my wikitech login info at graphite-labs.wikimedia.org, right? [16:45:34] PROBLEM - Puppet run on tools-webgrid-lighttpd-1403 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [16:45:44] shinken-wm, I'd ask Yuvi in a few hours [16:45:47] ShakespeareFan00, ^ [16:45:52] PROBLEM - Puppet run on tools-webgrid-lighttpd-1205 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [16:45:52] PROBLEM - Puppet run on tools-exec-1208 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [16:45:59] halfak: yeah [16:46:09] I think it's that my query is wrong [16:46:12] Hmmm... not working. Let me try some sanity checks [16:46:18] PROBLEM - Puppet run on tools-logs-01 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [16:46:20] PROBLEM - Puppet run on tools-flannel-etcd-02 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [16:46:30] PROBLEM - Puppet run on tools-worker-1017 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [16:46:34] PROBLEM - Puppet run on tools-k8s-etcd-01 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [16:46:38] PROBLEM - Puppet run on tools-exec-1406 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [16:46:38] PROBLEM - Puppet run on tools-exec-1217 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [16:46:43] ShakespeareFan00: jsut glanced, but how does an IN where cond exclude something? [16:46:50] PROBLEM - Puppet run on tools-elastic-01 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [16:46:56] huh I can't update the topic [16:47:00] PROBLEM - Puppet run on tools-exec-1409 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [16:47:01] bd808: It worked previously [16:47:11] If you can design the SQL query so it works [16:47:12] PROBLEM - Puppet run on tools-worker-1016 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [16:47:34] PROBLEM - Puppet run on tools-webgrid-lighttpd-1412 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [16:47:36] PROBLEM - Puppet run on tools-worker-1009 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [16:47:40] PROBLEM - Puppet run on tools-web-static-01 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [16:47:43] chasemp: I think the channel perms got locked down due to the spammer the other day [16:47:50] PROBLEM - Puppet run on tools-mail is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [16:47:51] well bunk [16:47:58] What i want to do is exclude the listed template links [16:47:59] bd808: can you update Normal to Upgrade in Progress? [16:48:02] PROBLEM - Puppet run on tools-worker-1020 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [0.0] [16:48:08] PROBLEM - Puppet run on tools-exec-1216 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [16:48:16] PROBLEM - Puppet run on tools-k8s-etcd-02 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [16:48:19] chasemp: :( I have no rights that I know of [16:48:41] PROBLEM - Puppet run on tools-bastion-02 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [16:48:43] PROBLEM - Puppet run on tools-worker-1018 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [16:49:21] PROBLEM - Puppet run on tools-webgrid-lighttpd-1410 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [16:49:22] ShakespeareFan00, your query worked fine on the replicas themselevs [16:49:24] themselves [16:49:26] I checked that [16:49:32] Odd. [16:49:47] Otherwise I wouldn't've told you to wait for yuvi :) [16:50:11] PROBLEM - Puppet run on tools-webgrid-generic-1405 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [16:50:14] Which queires? [16:50:15] Krenair: do you have rights to change topic here (channel is now +t) [16:50:17] yeah. I have confirmed that I can't log into graphite-labs.wikimedia.org with the same creds that work at wikitech.wikimedia.org [16:50:18] PROBLEM - Puppet run on tools-exec-1410 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [16:50:26] PROBLEM - Puppet run on tools-bastion-03 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [16:50:28] The ones on the phab ticket or the NFUR finding ones mentioned in IRC today? [16:50:30] * halfak can wait until the issues are resolved. [16:50:40] PROBLEM - Puppet run on tools-exec-cyberbot is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [16:50:42] PROBLEM - Puppet run on tools-exec-1209 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [16:50:42] bd808, no. I have no special IRC rights outside of #mediawiki [16:50:53] k [16:51:00] PROBLEM - Puppet run on tools-worker-1011 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [16:51:04] PROBLEM - Puppet run on tools-webgrid-lighttpd-1409 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [16:51:06] PROBLEM - Puppet run on tools-webgrid-lighttpd-1408 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [16:51:11] Yuvi and And.rew are ops in here [16:51:32] it's cool no biggie, I'll add changing topic here to our big changes check list [16:51:37] and we'll have to dole out some perms [16:51:39] as are Coren, Ryan and Sumana [16:51:49] BOther [16:51:51] and jeremyb [16:51:51] MY fault [16:51:56] PROBLEM - Puppet run on tools-worker-1006 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [16:51:57] really ShakespeareFan00? [16:51:58] PROBLEM - Puppet run on tools-webgrid-generic-1402 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [16:51:58] PROBLEM - Puppet run on tools-k8s-etcd-03 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [16:52:04] PROBLEM - Puppet run on tools-webgrid-lighttpd-1415 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [16:52:05] Forget a WHERE clause and the synatx checker didn't tell me ;) [16:52:18] PROBLEM - Puppet run on tools-webgrid-lighttpd-1202 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [16:52:22] ShakespeareFan00, your results have just appeared [16:52:22] PROBLEM - Puppet run on tools-grid-shadow is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [16:52:27] in quarry [16:52:28] Yeah [16:52:30] PROBLEM - Puppet run on tools-exec-1220 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [16:52:31] maybe it took a while to load? [16:52:34] Possibly [16:52:39] PROBLEM - Puppet run on tools-worker-1021 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [16:53:05] THE NFUR Finding queries were badly written [16:53:15] PROBLEM - Puppet run on tools-webgrid-lighttpd-1203 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [16:53:24] I'd left out a WHERE meaning ti was querying something other than what I thought it was ;) [16:53:39] Always a logical error to check for :( [16:53:41] (sigh) [16:53:49] PROBLEM - Puppet run on tools-webgrid-lighttpd-1201 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [16:54:03] PROBLEM - Puppet run on tools-docker-test-02 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [16:54:21] PROBLEM - Puppet run on tools-webgrid-lighttpd-1411 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [16:54:27] PROBLEM - Puppet run on tools-cron-01 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [16:54:47] PROBLEM - Puppet run on tools-worker-1010 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [16:54:56] Krenair : Thanks [16:55:07] ShakespeareFan00, I didn't do anything, you're welcome :) [16:55:09] PROBLEM - Puppet run on tools-webgrid-lighttpd-1208 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [0.0] [16:55:17] PROBLEM - Puppet run on tools-exec-1207 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [16:55:18] Thanks for being patient [16:55:19] PROBLEM - Puppet run on tools-exec-1203 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [16:55:21] heh [16:55:31] And for looking into the phab ticket I raised :) [16:55:31] PROBLEM - Puppet run on tools-redis-1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [16:55:39] PROBLEM - Puppet run on tools-web-static-02 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [16:55:57] PROBLEM - Puppet run on tools-exec-1215 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [16:55:59] PROBLEM - Puppet run on tools-webgrid-lighttpd-1401 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [16:56:03] PROBLEM - Puppet run on tools-elastic-02 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [16:56:10] https://quarry.wmflabs.org/query/6052 - Still seeing phantoms here... (but already reported these) [16:56:38] PROBLEM - Puppet run on tools-exec-1218 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [16:56:52] PROBLEM - Puppet run on tools-proxy-01 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [16:56:54] PROBLEM - Puppet run on tools-exec-1205 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [16:57:03] PROBLEM - Puppet run on tools-checker-01 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [16:57:25] PROBLEM - Puppet run on tools-worker-1022 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [16:57:41] PROBLEM - Puppet run on tools-docker-registry-01 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [16:57:51] PROBLEM - Puppet run on tools-webgrid-lighttpd-1404 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [16:57:55] PROBLEM - Puppet run on tools-exec-1204 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [16:57:59] PROBLEM - Puppet run on tools-exec-1219 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [16:58:11] PROBLEM - Puppet run on tools-webgrid-lighttpd-1204 is CRITICAL: CRITICAL: 62.50% of data above the critical threshold [0.0] [16:58:47] bd808, okay so it turns out I have a lot more IRC permissions than I thought I did: op in mediawiki, mediawiki-core, mediawiki-feed, mediawiki-visualeditor, wikimedia-editing, wikimedia-staff, and wikimedia-codereview - how do I accumulate all these things? [16:58:51] PROBLEM - Puppet run on tools-exec-1405 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [16:59:15] PROBLEM - Puppet run on tools-worker-1019 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [16:59:33] nothing in here, -tech, -operations, -dev or anywhere more commonly useful though [17:00:14] PROBLEM - Puppet run on tools-webgrid-lighttpd-1414 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [17:00:16] PROBLEM - Puppet run on tools-webgrid-lighttpd-1406 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [0.0] [17:00:28] PROBLEM - Puppet run on tools-exec-1211 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [17:00:30] PROBLEM - Puppet run on tools-webgrid-lighttpd-1407 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [17:01:05] PROBLEM - Puppet run on tools-webgrid-generic-1401 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [17:01:05] PROBLEM - Puppet run on tools-webgrid-lighttpd-1402 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [17:01:32] should the /topic be changed until after the upgrade? :) [17:01:37] PROBLEM - Puppet run on tools-exec-1213 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [17:01:53] Krenair: Something that doesn't yet exist on Quarry is the ability to let other people 'update' your query but not change it [17:02:03] (I.e a "Refresh" button...) [17:02:07] ShakespeareFan00, refresh results? [17:02:20] Depending on the query size , yes [17:02:23] could open a feature suggestion ticket [17:02:26] PROBLEM - Puppet run on tools-exec-1404 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [17:02:34] Krenair: Will consider that [17:02:38] PROBLEM - Puppet run on tools-webgrid-lighttpd-1207 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [17:02:46] PROBLEM - Puppet run on tools-exec-1408 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [17:02:49] Most of the queries I've written are replacemennts for ones I used to use at WP:DBR [17:03:03] or CATSCAN [17:04:11] Also another feature I might suggest is to limit query to a subset of the database, like pages that are known to have been edited in the last 24hours or so [17:04:44] I'm not sure if the replication can do 'journaling' like that though [17:04:49] incoming RECOVERY storm I think [17:05:18] ShakespeareFan00, um, that sounds like you want it writing SQL for you? [17:05:36] Quarry takes SQL and executes it on the replica server for you... [17:05:38] PROBLEM - Puppet run on tools-webgrid-generic-1403 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [17:05:40] PROBLEM - Puppet run on tools-webgrid-lighttpd-1413 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [17:06:04] PROBLEM - Puppet run on tools-webgrid-generic-1404 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [17:06:19] Krenair: I'm not sure it's possible to do a query based on timestamps directly from the page table? [17:06:32] If it is, then it just another WHERE clause [17:07:02] ShakespeareFan00, you want the page_touched field [17:07:07] Thanks [17:07:07] I think [17:07:27] or I suppose you could join revision [17:07:41] That too [17:08:13] Searching revisions though EATS server time :) [17:11:16] revision_userindex? probably not useful for this [17:12:23] Guest18760: want to identify, yuvi? :D [17:12:25] Guest18760: It was about limitiing a query to a specifc time range [17:12:53] RECOVERY - Puppet run on tools-elastic-03 is OK: OK: Less than 1.00% above the threshold [0.0] [17:13:03] This could be done straight in the SQL using a suitable clause... or alterntively there could be subsstes of the db that could used... [17:13:45] Running through the entire db of a project to find 5 or 6 edits in the last 24 hours eeemed to be overkill [17:13:51] *seemed [17:14:01] oh there you are [17:14:16] ShakespeareFan00, what do you mean by subsets of the DB? [17:14:45] Krenair: A Smaller set of data representing activity over say the last 24 hours or so [17:14:51] because it sounds to me like you want to reimplement half of SQL and/or mysql [17:14:59] Ah OK [17:15:10] I obviously don't understand how SQL works :( [17:15:10] RECOVERY - Puppet run on tools-exec-1206 is OK: OK: Less than 1.00% above the threshold [0.0] [17:15:14] RECOVERY - Puppet run on tools-exec-1407 is OK: OK: Less than 1.00% above the threshold [0.0] [17:15:15] RECOVERY - Puppet run on tools-precise-dev is OK: OK: Less than 1.00% above the threshold [0.0] [17:15:22] RECOVERY - Puppet run on tools-exec-1403 is OK: OK: Less than 1.00% above the threshold [0.0] [17:15:23] RECOVERY - Puppet run on tools-webgrid-lighttpd-1206 is OK: OK: Less than 1.00% above the threshold [0.0] [17:15:25] Quarry takes your SQL query and gives it to MySQL [17:15:26] RECOVERY - Puppet run on tools-exec-1210 is OK: OK: Less than 1.00% above the threshold [0.0] [17:15:27] RECOVERY - Puppet run on tools-flannel-etcd-03 is OK: OK: Less than 1.00% above the threshold [0.0] [17:15:30] RECOVERY - Puppet run on tools-exec-1401 is OK: OK: Less than 1.00% above the threshold [0.0] [17:15:36] and MY SQL optomises the query? [17:15:41] RECOVERY - Puppet run on tools-exec-1402 is OK: OK: Less than 1.00% above the threshold [0.0] [17:15:43] RECOVERY - Puppet run on tools-proxy-02 is OK: OK: Less than 1.00% above the threshold [0.0] [17:15:43] (using indexing or something)? [17:15:46] yes [17:15:49] RECOVERY - Puppet run on tools-webgrid-lighttpd-1405 is OK: OK: Less than 1.00% above the threshold [0.0] [17:16:01] ah :D [17:16:02] OKay [17:16:04] Sorry [17:16:06] that's better [17:16:19] yuvipanda: Do you read phab-tickets? [17:16:30] halfak godog I didn't know grafana used POSTs for dashboards. [17:16:39] I just woke up, so not yet [17:17:03] RECOVERY - Puppet run on tools-redis-1002 is OK: OK: Less than 1.00% above the threshold [0.0] [17:17:17] I setup redirects to not break it [17:17:31] RECOVERY - Puppet run on tools-prometheus-01 is OK: OK: Less than 1.00% above the threshold [0.0] [17:17:37] RECOVERY - Puppet run on tools-bastion-05 is OK: OK: Less than 1.00% above the threshold [0.0] [17:17:39] RECOVERY - Puppet run on tools-checker-02 is OK: OK: Less than 1.00% above the threshold [0.0] [17:17:48] godog is there a way to have grafana use GETs? [17:17:49] Krenair: Sorry [17:18:01] RECOVERY - Puppet run on tools-mail-01 is OK: OK: Less than 1.00% above the threshold [0.0] [17:18:07] (I must sound clueless at times.) [17:18:15] RECOVERY - Puppet run on tools-flannel-etcd-01 is OK: OK: Less than 1.00% above the threshold [0.0] [17:18:25] for what? [17:18:32] ok, possible labs network interruption coming up [17:18:53] Krenair: I'm sorry for sounding clueless [17:19:11] I don't mind [17:19:15] and coming up with ideas that anyone that actually understood stuff wouldn't ask about [17:19:23] RECOVERY - Puppet run on tools-docker-test-05 is OK: OK: Less than 1.00% above the threshold [0.0] [17:19:24] yuvipanda: I bet the POST is built in due to worries about url length [17:19:45] RECOVERY - Puppet run on tools-worker-1001 is OK: OK: Less than 1.00% above the threshold [0.0] [17:19:49] RECOVERY - Puppet run on tools-services-02 is OK: OK: Less than 1.00% above the threshold [0.0] [17:19:51] RECOVERY - Puppet run on tools-services-01 is OK: OK: Less than 1.00% above the threshold [0.0] [17:19:57] yuvipanda: no idea :( [17:20:14] RECOVERY - Puppet run on tools-exec-1410 is OK: OK: Less than 1.00% above the threshold [0.0] [17:20:24] yuvipanda: I filed a phab-ticket about this - https://quarry.wmflabs.org/query/6052 [17:20:27] RECOVERY - Puppet run on tools-exec-1214 is OK: OK: Less than 1.00% above the threshold [0.0] [17:20:31] I'm seeing "phantoms" [17:20:35] RECOVERY - Puppet run on tools-webgrid-lighttpd-1403 is OK: OK: Less than 1.00% above the threshold [0.0] [17:21:03] RECOVERY - Puppet run on tools-exec-1201 is OK: OK: Less than 1.00% above the threshold [0.0] [17:21:11] (03CR) 10Yuvipanda: [C: 032] Fix this so it correctly says who the user is changing the patch [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/302416 (https://phabricator.wikimedia.org/T141329) (owner: 10Paladox) [17:21:21] RECOVERY - Puppet run on tools-flannel-etcd-02 is OK: OK: Less than 1.00% above the threshold [0.0] [17:21:29] RECOVERY - Puppet run on tools-worker-1017 is OK: OK: Less than 1.00% above the threshold [0.0] [17:21:33] RECOVERY - Puppet run on tools-k8s-etcd-01 is OK: OK: Less than 1.00% above the threshold [0.0] [17:21:38] RECOVERY - Puppet run on tools-exec-1406 is OK: OK: Less than 1.00% above the threshold [0.0] [17:21:38] RECOVERY - Puppet run on tools-exec-1217 is OK: OK: Less than 1.00% above the threshold [0.0] [17:21:48] RECOVERY - Puppet run on tools-docker-test-04 is OK: OK: Less than 1.00% above the threshold [0.0] [17:22:00] RECOVERY - Puppet run on tools-exec-1409 is OK: OK: Less than 1.00% above the threshold [0.0] [17:22:10] RECOVERY - Puppet run on tools-worker-1016 is OK: OK: Less than 1.00% above the threshold [0.0] [17:22:34] I'm going to kill shinken-wm [17:22:34] RECOVERY - Puppet run on tools-webgrid-lighttpd-1412 is OK: OK: Less than 1.00% above the threshold [0.0] [17:22:50] RECOVERY - Puppet run on tools-mail is OK: OK: Less than 1.00% above the threshold [0.0] [17:23:06] RECOVERY - Puppet run on tools-exec-1216 is OK: OK: Less than 1.00% above the threshold [0.0] [17:23:40] RECOVERY - Puppet run on tools-k8s-master-01 is OK: OK: Less than 1.00% above the threshold [0.0] [17:23:41] RECOVERY - Puppet run on tools-bastion-02 is OK: OK: Less than 1.00% above the threshold [0.0] [17:23:42] yuvipanda, could just mute it in here [17:33:02] RECOVERY - Puppet run on tools-exec-1219 is OK: OK: Less than 1.00% above the threshold [0.0] [17:33:16] RECOVERY - Puppet run on tools-webgrid-lighttpd-1203 is OK: OK: Less than 1.00% above the threshold [0.0] [17:33:48] RECOVERY - Puppet run on tools-webgrid-lighttpd-1201 is OK: OK: Less than 1.00% above the threshold [0.0] [17:34:20] RECOVERY - Puppet run on tools-webgrid-lighttpd-1411 is OK: OK: Less than 1.00% above the threshold [0.0] [17:34:22] RECOVERY - Puppet run on tools-webgrid-lighttpd-1410 is OK: OK: Less than 1.00% above the threshold [0.0] [17:34:47] RECOVERY - Puppet run on tools-worker-1010 is OK: OK: Less than 1.00% above the threshold [0.0] [17:35:07] RECOVERY - Puppet run on tools-webgrid-lighttpd-1208 is OK: OK: Less than 1.00% above the threshold [0.0] [17:35:17] RECOVERY - Puppet run on tools-webgrid-lighttpd-1406 is OK: OK: Less than 1.00% above the threshold [0.0] [17:36:04] RECOVERY - Puppet run on tools-webgrid-generic-1401 is OK: OK: Less than 1.00% above the threshold [0.0] [17:36:05] RECOVERY - Puppet run on tools-webgrid-lighttpd-1402 is OK: OK: Less than 1.00% above the threshold [0.0] [17:39:56] ShakespeareFan00 which ticket? I assume this is labsdb related, not entirely sure what I can do to help :( [17:40:13] https://phabricator.wikimedia.org/T141818 [17:41:28] 06Labs, 10Quarry, 10DBA: Phantom entries in "Quarry" query and or labs replica of enwiki db - https://phabricator.wikimedia.org/T141818#2516050 (10yuvipanda) [17:41:35] I've added the DBA tag, I guess the DBA will take a look at it when they can. [17:46:31] 06Labs, 10Quarry, 10DBA: Phantom entries in "Quarry" query and or labs replica of enwiki db - https://phabricator.wikimedia.org/T141818#2513270 (10jcrespo) Please follow the recommendations mentioned at: https://wikitech.wikimedia.org/wiki/Help:Tool_Labs/Database/Replica_drift [17:47:54] 06Labs, 10Quarry, 10DBA: Phantom entries in "Quarry" query and or labs replica of enwiki db - https://phabricator.wikimedia.org/T141818#2516093 (10jcrespo) [17:47:58] 06Labs, 10DBA: Labs database replica drift - https://phabricator.wikimedia.org/T138967#2516091 (10jcrespo) [17:50:11] 06Labs, 10Quarry, 10DBA: Phantom entries in "Quarry" query and or labs replica of enwiki db - https://phabricator.wikimedia.org/T141818#2516102 (10ShakespeareFan00) [17:50:15] 06Labs, 10DBA: Labs database replica drift - https://phabricator.wikimedia.org/T138967#2516103 (10ShakespeareFan00) [17:52:02] 06Labs, 10DBA: Labs database replica drift - https://phabricator.wikimedia.org/T138967#2415416 (10ShakespeareFan00) Notifying here concerning https://phabricator.wikimedia.org/T141818 [17:56:38] 06Labs, 10DBA: Labs database replica drift - https://phabricator.wikimedia.org/T138967#2516123 (10jcrespo) May I ask you to please copy the query here to check the drift? We keep here the older subtasks only because they are older than this one, but following a single format will help speed up the resolution o... [18:32:31] milimetric meeting? [18:33:21] is the labs upgrade done done? [18:38:55] milimetric lost video, refreshing [19:06:02] PROBLEM - SSH on tools-merlbot-proxy is CRITICAL: Server answer [19:06:35] (03Merged) 10jenkins-bot: Fix this so it correctly says who the user is changing the patch [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/302416 (https://phabricator.wikimedia.org/T141329) (owner: 10Paladox) [19:08:57] 06Labs, 10Tool-Labs: /home/ilya missing replica.my.cnf - https://phabricator.wikimedia.org/T140592#2516410 (10intracer) [19:13:17] I am running http://meta.wikimedia.org/wiki/WM-Bot version wikimedia bot v. 2.8.0.0 [libirc v. 1.0.3] my source code is licensed under GPL and located at https://github.com/benapetr/wikimedia-bot I will be very happy if you fix my bugs or implement new features [19:13:17] @help [19:14:07] We need better spam prevention rather than +t [19:14:32] +t just inconveniences everyone who isn't an op from setting the topic [19:16:22] yuvipanda: poky poke [19:17:21] matanya just ask :) [19:17:29] pm you [19:17:30] (I'm going afk shortly, but others might be able to help you) [19:18:17] yuvipanda could you deploy https://gerrit.wikimedia.org/r/302416 please [19:18:19] it merged [19:18:20] ? [19:22:05] 06Labs, 10Tool-Labs: Write diamond collector for gridengine job count stats - https://phabricator.wikimedia.org/T140999#2516497 (10yuvipanda) This doesn't seem to be run by diamond properly, I'll take a look in a bit. [19:23:05] paladox I have to go somehwere to get my laptop charger back just now, I'll deploy in a few hours! [19:23:19] Ok [19:23:21] thanks [19:24:17] andrewbogott: planning on a rights review for wikitech at some point ? [19:27:07] 06Labs: Track labs instances hanging - https://phabricator.wikimedia.org/T141673#2516513 (10chasemp) [19:30:17] !log git adding Alex Monk (Krenair) to project [19:30:21] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Git/SAL, Master [19:31:02] RECOVERY - SSH on tools-merlbot-proxy is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0) [19:35:19] PROBLEM - Puppet staleness on tools-merlbot-proxy is CRITICAL: CRITICAL: 12.50% of data above the critical threshold [43200.0] [19:40:21] RECOVERY - Puppet staleness on tools-merlbot-proxy is OK: OK: Less than 1.00% above the threshold [3600.0] [19:49:06] matanya: do you have something specific in mind? [19:50:32] andrewbogott: https://wikitech.wikimedia.org/wiki/Special:ListUsers/sysop [19:52:24] Eloquence, LVilla (WMF), Sumanah and WikiSysop? [19:52:38] and Coren? [19:59:37] yes, that [20:31:28] what is 'WikiSysop'? [20:32:40] andrewbogott: https://wikitech.wikimedia.org/wiki/OpenStack#Create_a_wiki_user_for_yourself.2C_and_promote_it_using_WikiSysop? a user name, I think [20:33:44] yeah, so it's a special user for bootstrapping… not obvious to me that it shouldn't have admin rights [20:34:46] I think it's historical [20:35:56] valhallasw`cloud Hi im wondering if this will fix the grrrit-bot https://gerrit.wikimedia.org/r/#/c/302416/ [20:35:59] from before my time [20:36:10] Krenair: mine too apparently [20:36:14] paladox: yuvipanda already merged that? [20:36:31] Yeh [20:36:46] But hasent been deployed [20:36:47] yet [20:36:51] * Krenair is digging through core history [20:36:55] im just wondering did i do the correct fix. [20:37:09] valhallasw`cloud ^^ [20:37:10] paladox: I have no clue. [20:37:13] Ok [20:37:45] valhallasw`cloud im wondering could you deploy it please [20:37:45] ? [20:37:56] No. [20:38:03] ok [20:38:35] deploying and possibly reverting grrrit-wm is not something I'm comfortable with. [20:42:40] https://www.mediawiki.org/wiki/Special:Code/MediaWiki/69518 [20:43:01] it was also used in some old parser tests and things [20:43:23] probably comes from the original wikitech install [20:44:09] Oh [20:44:12] ok [20:45:52] PROBLEM - Puppet run on tools-services-01 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [21:20:52] RECOVERY - Puppet run on tools-services-01 is OK: OK: Less than 1.00% above the threshold [0.0] [21:32:24] I'm having issues with routing for ORES in labs. [21:32:29] I think [21:32:33] https://ores.wmflabs.org/ times out [21:32:47] But if I log into our web nodes and "wget localhost:8080" everything is working fine. [21:33:00] I see that tools.wmflabs.org is up [21:33:04] SO it's not fully proxy [21:34:03] tools.wmflabs.org isn't using the general proxy mechanism [21:34:04] Check quarry? [21:34:07] I'll be at a laptop in a few mins [21:34:18] Ahh yeah. Same story for quarry [21:34:24] chasemp, ^ [21:34:44] Looks like the DNS proxy might have fallen over [21:35:22] hm andrewbogott about^? [21:35:41] novaproxy-01 [21:36:22] yuvipanda: yeah it seems novaproxy is out atm [21:36:32] Is the instance up? [21:36:45] (am 3min from office now) [21:37:20] Looking [21:37:30] I tihnk the instance may be down [21:38:01] I can ping the instance [21:38:05] Console log? [21:38:07] I can't ssh anyway [21:38:12] I wonder if it got hit by the freezes [21:38:35] yep [21:38:36] 'the instance' is the proxy instance or just halfak's instance? [21:38:36] Salt? [21:38:38] seems like similar issue [21:38:40] proxy andrewbogott [21:38:48] andrewbogott: seems novaproxy-01 wigged out [21:38:52] I'm going to reboot [21:38:59] chasemp, check salt first [21:39:02] we have a seemingly serious io issue [21:39:05] check salt for what? [21:39:08] chasemp: ok, that's what I would do [21:39:14] to see if you can get in that way and find the issue [21:39:37] salt is a pretty useless diagnostic tool here, I'm going to reboot to restore the outage and then look at it [21:39:42] but best thing we can do is get the console going [21:39:57] fine, but salt will let you run commands on the instance [21:40:03] unless it's really dead [21:40:13] sure but what commands do you want to see? [21:40:33] I'm not being obtuse here I have no clue what salt things you are thinking [21:40:44] and atm the outage outways any I can think of [21:40:50] just reboot it [21:40:52] ok am on [21:41:07] it's rebooting [21:41:09] ok [21:41:27] krenair 100% of the time with instance freezes salt has been useless, and I've had a max of maybe 3-4 times ever salt has been useful. [21:42:06] is the instance frozen if you can ping it? [21:42:19] yes because it doesn't do anything else useful. [21:42:35] labvirt1001.eqiad.wmnet too [21:42:52] right [21:43:00] well, frozen is a pretty useless and broad idea [21:43:06] nodes are failing at various levels of io issues (possibly) [21:43:13] but [21:43:16] [11409409.151647] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [21:43:16] [11409409.152879] INFO: task cron:14373 blocked for more than 120 seconds. [21:43:17] chasemp did you see what was on teh console log before you rebooted? [21:43:21] yes [21:43:23] right [21:43:46] https://tools.wmflabs.org/nagf/?project=project-proxy you can also see it stops sending stuff to graphite [21:43:57] [11409409.150117] INFO: task cron:14372 blocked for more than 120 seconds. [21:43:57] [11409409.151003] Not tainted 3.16.0-4-amd64 #1 [21:43:58] [11409409.151647] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [21:43:58] [11409409.152879] INFO: task cron:14373 blocked for more than 120 seconds. [21:43:58] [11409409.153825] Not tainted 3.16.0-4-amd64 #1 [21:43:58] [11409409.155729] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [21:43:59] [11409409.156902] INFO: task cron:14374 blocked for more than 120 seconds. [21:43:59] [11409409.157905] Not tainted 3.16.0-4-amd64 #1 [21:43:59] [11409409.159943] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [21:44:00] a cron job locked up [21:44:00] conceivably on io [21:44:00] let me look at io [21:44:14] that's cron itself no, rather than a cronjob [21:44:42] I see no io spikes [21:44:53] https://graphite-labs.wikimedia.org/render/?width=588&height=310&_salt=1470174276.01&target=project-proxy.novaproxy-01.iostat.vda.io [21:45:25] halfak it should be back now [21:45:33] nothing that means anything to me [21:45:33] Services are recovering. [21:45:36] yeah [21:45:36] Thanks folks [21:45:45] np thanks for bringing it to our attention, halfak [21:45:48] :D [21:46:03] so thoroughly confusing, but we seem to have a resource contention issue(s) [21:47:14] possibly. [21:47:24] * yuvipanda looks at ganglia for labvirt servers [21:48:27] !log tools.stashbot Updated to 21ca5f1 [21:48:30] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.stashbot/SAL, Master [21:49:09] can't really see anything useful there [21:49:30] yeah, it's an artificial contention I think [21:50:20] 06Labs: Track labs instances hanging - https://phabricator.wikimedia.org/T141673#2516967 (10yuvipanda) [21:50:26] ok, I updated ^ [21:52:31] 21:42:50 I see procs go into d-wait [21:52:53] 532 root 1 0.00s 0.00s 25804K 2916K 1528K 0K N- - D 0 2% nginx [21:52:53] 533 root 1 0.00s 0.00s 20172K 3396K 1376K 0K N- - D 1 2% salt-minion [21:52:54] 179 root 1 0.03s 0.00s 34376K 3680K 1196K 0K N- - S 0 1% systemd-udevd [21:52:55] 181 root 1 0.00s 0.00s 34376K 3644K 1124K 0K N- - S 0 1% systemd-udevd [21:52:57] 595 root 1 0.00s 0.01s 16568K 4288K 904K 0K N- - R 1 1% atop [21:52:59] 159 root 1 0.04s 0.07s 32968K 3180K 788K 0K N- - S 1 1% systemd-journa [21:53:01] 178 root 1 0.00s 0.00s 34744K 4068K 780K 0K N- - S 1 1% systemd-udevd [21:53:03] 180 root 1 0.00s 0.00s 34376K 3448K 780K 0K N- - S 1 1% systemd-udevd [21:53:05] 598 root 1 0.00s 0.00s 21032K 3012K 780K 0K N- - D 1 1% ntpd [21:53:07] 534 diamond 1 0.00s 0.00s 20172K 3420K 756K 0K N- - D 0 1% diamond [21:53:09] 549 root 1 0.00s 0.00s 39744K 2076K 744K 0K N- - D 0 1% dbus-daemon [21:53:37] chasemp pastebin :P [21:53:39] it's all garbled... [21:53:57] then again it jumps from 21:20 to 21:42 [21:54:07] missed teh 21:30 collection [21:55:24] chasemp is that atop output? [21:55:28] it is [21:55:39] also can you put that in a pastebin so I can read it? [21:55:48] pasting things onto IRC directly almost always garbles them and loses tabs / spaces / etc [21:56:12] var/log/atop# atop -r atop_20160802 -b 20:00 [21:56:15] sure [21:56:17] 06Labs: Track labs instances hanging - https://phabricator.wikimedia.org/T141673#2517006 (10yuvipanda) [21:56:19] ah, thanks [21:56:40] "t" to jump to next collection [21:57:18] cron indeed spiked in io tho [21:58:37] not much of a spike but still either coincidentally it was caught or the surge itself triggered [21:59:55] yeah seems like not a spike just a recognition of it's blocking state maybe [22:00:27] I also see [22:00:34] super high ksoftirqd usage [22:00:38] which I was seeing in a lot of places [22:00:47] > 3 root 20 0 0 0 0 R 5.3 0.0 0:26.62 ksoftirqd/0 [22:00:58] it's hard to know what to make of that, cause or effect [22:01:03] 4-5% constant usage [22:01:12] well, upgrading kernel got rid of it on most of the worker nodes [22:01:22] and in talking to moritzm the next step for us was to install irqbalance if it came back [22:01:47] novaproxy-01 is on a old kernel, so I'm tempted to upgrade it, but that's *maybe* unrelated to what happened to it, but I'm not entirely sure [22:01:58] has this happened on any nodes where we upgrade the kernel and we see ksoftirqd usage drop? [22:02:08] I thought it had [22:02:22] it's jessie also [22:02:27] chasemp can you rephrase? I don't understand what that sentence meant [22:02:28] which is even more interesting then [22:02:33] what does 'this' refer to? [22:03:00] lockup, asking if we have seen the lockup issue post-kernel update where ksoftirqd usage drops [22:03:19] yup [22:03:27] so I think they're unrelated [22:03:31] yeah [22:03:34] so probably shouldn't do it now, confounding effects. [22:03:57] one thing k8s has in common with these nodes and uncommon w/ the rest of tools is [22:03:58] jessie [22:04:10] the thread seems to carry through and I don't knwo what to make of it yet [22:04:34] so deployment-logstash2 I wonder, if it was jessie or trusty [22:04:39] since it was stuck with very similar symptoms [22:04:40] looking [22:04:56] it is jessie [22:05:32] yeah [22:06:09] 06Labs: Track labs instances hanging - https://phabricator.wikimedia.org/T141673#2517136 (10yuvipanda) [22:06:11] * yuvipanda adds 'OS' to the table [22:06:11] So maybe the common thing isn't k8s (as we puzzled over) but instead Jessie [22:07:14] so we need console, and to catch this in teh act, and to loop in one of the debian guys [22:07:17] atm [22:07:46] unless we can come up with a non-jessie outlier and not that it's scientific but sure seems consistent [22:09:14] yah [22:10:10] chasemp so next step now is to just get console access working? [22:10:19] I wish google had a "show me results from last year" [22:10:33] chasemp can you write out your thinking on https://phabricator.wikimedia.org/T141673? I'll figure out a way to make sure we have appropriate monitoring for the novaproxy [22:12:19] yuvipanda: I will yep, kind of in an awkward place atm and need to close up but I'm making my notes and will update first thing am or when i can come back [22:12:27] also doing a bit of digging before [22:12:33] chasemp cool. [22:12:43] I'll dig into monitoring [22:14:10] holy crap there is a lot of logging to qemu [22:15:18] oh? [22:15:42] nothing useful it seems [22:17:30] yuvipanda, andrewbogott, would it be OK to have Sigyn in here? It's a bot from freenode that k lines spammers. (Need to get approval from chan ops to have it in here) [22:18:06] paladox I deployed your change, test? [22:18:12] ok thanks [22:18:22] tom29739 assuming that's all it does, sure! who runs it? [22:18:38] yuvipanda could you restart the grrrit-wm bot please [22:18:40] yuvipanda: freenode [22:18:43] since gerrit was restarted [22:18:45] but not the bot [22:18:50] please [22:18:51] > [22:18:52] ? [22:19:00] paladox I just restarted the bot [22:19:05] Oh [22:19:07] ok [22:19:09] thanks [22:19:28] It seems the bot left [22:19:29] ? [22:19:32] wait gah [22:19:35] I just restarted it too [22:19:37] sorry [22:19:38] Oh [22:20:11] !log tools.lolrrit-wm re-re-started grrrit-wm [22:20:15] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.lolrrit-wm/SAL, Master [22:20:17] No it dosent seemed to have worked [22:20:27] paladox: it came back [22:20:33] Yep [22:21:50] thanks legoktm :) [22:22:46] yuvipanda it seems the bot is not working [22:22:47] any more [22:23:10] i tested with comment and uploading patches, ive only touched uploading patches part not comments [22:25:20] 06Labs: Track labs instances hanging - https://phabricator.wikimedia.org/T141673#2517277 (10Andrew) [22:28:59] (03Draft2) 10Paladox: Another fix so that it will tell you the correct user [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/302617 [22:29:02] (03Draft1) 10Paladox: Another fix so that it will tell you the correct user [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/302617 [22:31:26] paladox what's your wikitech username? [22:31:34] paladox [22:31:46] ah you aren't a memeber of the tools project yet [22:32:14] nope [22:32:38] !log tools added paladox to tools [22:32:42] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [22:32:45] thanks [22:32:52] !log tools.lolrrit-wm added paladox to tool [22:32:55] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.lolrrit-wm/SAL, Master [22:32:59] thanks [22:33:38] paladox https://wikitech.wikimedia.org/wiki/Grrrit-wm has info [22:33:48] Yep thankyou [22:34:00] paladox basically, the code is in ~/lolrrit-wm - so you can cherry pick your patch there, and restart the service based on instructions there and see how it goes [22:34:11] oh thanks [22:34:17] how do i cherry pick [22:34:24] do i just do the gerrit git checkout [22:34:43] For example like this [22:34:44] git fetch ssh://paladox@gerrit.wikimedia.org:29418/operations/puppet refs/changes/01/302601/3 && git checkout FETCH_HEAD [22:34:52] It would be exactly that command [22:34:54] but similar [22:34:56] ? [22:35:48] paladox yes, except 'git cherry-pick' instead of 'git checkout' in the end [22:35:57] Ok [22:36:03] How do i ssh into it please [22:36:04] ? [22:37:14] paladox: to get to tools it's "ssh tools-bastion-03" from the main labs bastion [22:37:22] Oh [22:37:34] Then "become " [22:37:38] Something like [22:37:38] Host gerrit-test3 [22:37:39] ProxyCommand ssh -a -W %h:%p -A paladox@bastion3.wmflabs.org [22:37:39] User paladox [22:37:42] oh [22:38:06] Or just direct ssh to login.tools.wmflabs.org [22:38:14] Ok [22:39:24] * tom29739 usually treats tools and the rest of labs as 2 different worlds and has separate shell windows [22:39:31] lol [22:40:06] tom29739 how i get into tools.lolrrit-wm [22:40:08] please [22:40:31] Never mind [22:40:33] managed it [22:40:39] become lolrrit-wm [22:40:51] That's right [22:41:04] Ok im applying it now [22:41:08] now how do i log it [22:41:10] please [22:41:11] ? [22:41:18] !log Depooling tools-worker 1012 and 1013 for T141126 [22:41:19] T141126: Investigate moving docker to use direct-lvm devicemapper storage driver - https://phabricator.wikimedia.org/T141126 [22:41:19] Depooling is not a valid project. [22:41:20] paladox: !log [22:41:30] Oh i mean for tools [22:41:30] !log tools Depooling tools-worker 1012 and 1013 for T141126 [22:41:31] T141126: Investigate moving docker to use direct-lvm devicemapper storage driver - https://phabricator.wikimedia.org/T141126 [22:41:35] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [22:41:41] So !log tools. message [22:41:49] paladox: ^ [22:42:04] That's the normal way for a tool [22:42:08] !log tools cherry picking 302617 onto lolrrit-wm [22:42:11] Ok thanks [22:42:13] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [22:42:31] * tom29739 face palms [22:42:42] paladox: that's gone into the tools log [22:42:47] Oh [22:42:49] woops [22:42:51] sorry [22:42:59] !log tools.lolrrit-wm cherry picking 302617 onto lolrrit-wm [22:43:02] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.lolrrit-wm/SAL, Master [22:44:00] tom29739 how do i restart it [22:44:05] https://wikitech.wikimedia.org/wiki/Grrrit-wm [22:44:12] PROBLEM - Host tools-worker-1012 is DOWN: CRITICAL - Host Unreachable (10.68.16.49) [22:44:13] Do i kubectl delete pod [22:44:13] ? [22:44:43] !log tools depool tools-worker-1015 for T141126 [22:44:44] T141126: Investigate moving docker to use direct-lvm devicemapper storage driver - https://phabricator.wikimedia.org/T141126 [22:44:47] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [22:45:15] !log tools.lolrrit-wm restarting grrrit-wm bot. [22:45:19] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.lolrrit-wm/SAL, Master [22:45:20] PROBLEM - Host tools-worker-1013 is DOWN: CRITICAL - Host Unreachable (10.68.18.118) [22:46:19] yay [22:46:21] it joined [22:46:24] lets test [22:48:32] PROBLEM - Host tools-worker-1015 is DOWN: CRITICAL - Host Unreachable (10.68.23.37) [22:48:54] yuvipanda the bot keeps restarting [22:48:55] ? [22:49:01] paladox look at logs [22:49:07] probably means there's a bug in your code [22:49:14] Ok [22:49:32] !log tools depooled tools-worker-1014 as well for T141126 [22:49:34] T141126: Investigate moving docker to use direct-lvm devicemapper storage driver - https://phabricator.wikimedia.org/T141126 [22:49:37] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [22:51:02] yuvipanda i carnt find anything in the logs causing this [22:51:30] Im going to revert [22:51:36] and try again tomarror [22:52:15] ok! [22:52:33] yuvipanda do you mind me manually hacking it to revert my code [22:52:45] since it is getting late and will try again tomarror [22:52:53] PROBLEM - Host tools-worker-1014 is DOWN: CRITICAL - Host Unreachable (10.68.23.152) [22:54:53] (03Abandoned) 10Paladox: Another fix so that it will tell you the correct user [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/302617 (owner: 10Paladox) [22:56:08] paladox nope, feel free. [22:56:17] Thanks [22:57:12] !log tools.lolrrit-wm tempoaraily reverting two of my patches, will try and do more testing tomarror. [22:57:16] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.lolrrit-wm/SAL, Master [23:06:04] yuvipanda: Maybe you can ping bd808 for the meeting if he's with you? [23:08:58] 06Labs, 10Labs-Infrastructure, 06Operations: python-designateclient package version does not match between labtestweb2001 and silver - https://phabricator.wikimedia.org/T134543#2268964 (10AlexMonk-WMF) Update: @Andrew did this during the upgrade of Labs from Kilo to Liberty. [23:11:51] PROBLEM - Puppet run on tools-services-01 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [23:51:51] RECOVERY - Puppet run on tools-services-01 is OK: OK: Less than 1.00% above the threshold [0.0]