[00:44:43] 10Labs-project-wikistats: status of LXDE wikis - remove table? - https://phabricator.wikimedia.org/T111591#2335867 (10Dzahn) http://wiki.lxde.org/en/Main_Page looks up closing this as invalid making a new ticket for orain and pardus tables [00:46:23] 10Labs-project-wikistats: delete orain and pardus tables from wikistats - https://phabricator.wikimedia.org/T136460#2335869 (10Dzahn) [00:46:32] 10Labs-project-wikistats: status of LXDE wikis - remove table? - https://phabricator.wikimedia.org/T111591#1611218 (10Dzahn) 05Open>03declined >>! In T111591#1683297, @RobiH wrote: > But what actually should be removed, are the ORAIN and PARDUS tables. > Both domains expired for good. -> T136460 [00:51:45] @ping [00:51:51] @seen MusikAnimal [00:51:52] CP678|Laptop: Last time I saw MusikAnimal they were quitting the network with reason: Quit: Cheers N/A at 5/20/2016 10:42:43 PM (7d2h9m8s ago) [01:43:18] 10Labs-project-wikistats, 13Patch-For-Review, 07Schema-change: delete orain and pardus tables from wikistats - https://phabricator.wikimedia.org/T136460#2335916 (10Danny_B) [01:45:18] 10Labs-project-wikistats, 13Patch-For-Review: delete orain and pardus tables from wikistats - https://phabricator.wikimedia.org/T136460#2335917 (10Dzahn) [01:45:45] 10Labs-project-wikistats, 13Patch-For-Review: delete orain and pardus tables from wikistats - https://phabricator.wikimedia.org/T136460#2335869 (10Dzahn) No relation to "Schema-change". [06:00:48] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/ArthurPSmith was modified, changed by BryanDavis link https://wikitech.wikimedia.org/w/index.php?diff=585923 edit summary: [06:05:00] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Ата was modified, changed by BryanDavis link https://wikitech.wikimedia.org/w/index.php?diff=585937 edit summary: [06:26:43] Hello? [06:27:29] Any chance somone could help me with an issue I'm having with phabricator? [06:48:34] 06Labs, 10Tool-Labs: toolserver-home-archive is using 52G on Tools - https://phabricator.wikimedia.org/T136202#2336035 (10Nemo_bis) The point of having the archive there is that people usually don't know with large advance when they will need something from the archive. If the file is available, they can just... [07:06:01] Hello? [08:34:52] PROBLEM - Host tools-bastion-01 is DOWN: CRITICAL - Host Unreachable (10.68.17.228) [10:07:52] my job stuck in a 't' state [10:07:53] 6269380 0.32841 php_transl tools.liange t 05/28/2016 10:04:01 task@tools-exec-1408.eqiad.wmf 1 [10:24:29] PROBLEM - SSH on tools-exec-1207 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:06:30] Is tools-db or enwiki.labsdb having a problem? ClueBot can't see to connect to one of them (tools-db I think) [11:22:11] RichSmith: both seem online to me. [12:17:54] 10Tool-Labs-tools-Other, 10xTools-on-Labs: wikiviewstats webservice crashing all the time - https://phabricator.wikimedia.org/T122506#2336386 (10Mabandalone) I'm interested in contributing to fixing this. Can someone add me to the tool's group so I can have a look at the code on the server? [13:47:22] 10Tool-Labs-tools-Other, 10xTools-on-Labs: wikiviewstats webservice crashing all the time - https://phabricator.wikimedia.org/T122506#2336557 (10MusikAnimal) The "view stats" aspect was written to go off of the old data dumps, or possibly stats.grok.se, so you'll need to rework all of that logic to use the pag... [14:09:02] 06Labs, 10Tool-Labs, 13Patch-For-Review: Make http (404, 302, 301 etc) statistics for toolserver.org - https://phabricator.wikimedia.org/T85167#2336568 (10Ricordisamoa) >>! In T85167#2303833, @Dzahn wrote: > https://web.archive.org/web/20120213110804/https://fisheye.toolserver.org/ > > Welcome to the Toolse... [14:31:59] 10Tool-Labs-tools-Other, 10xTools-on-Labs: wikiviewstats webservice crashing all the time - https://phabricator.wikimedia.org/T122506#2336589 (10Mabandalone) >>! In T122506#2336557, @MusikAnimal wrote: > The "view stats" aspect was written to go off of the old data dumps, or possibly stats.grok.se, so you'll n... [14:32:14] Hi, does someone knows how to kill a job I have created (id) in the grid? [14:39:22] Kelson: qdel [14:40:41] valhallasw`cloud: I get "job 6668316 is already in deletion", so it seems the job is not "killed" or it takes a really long time [14:41:18] because it seems the job is still running [14:41:23] valhallasw`cloud: $ job -v wp10-select [14:41:23] Job 'wp10-select' has been running since 2016-05-23T19:26:31 as id 6668316 [14:42:05] Kelson: hrm. Sometimes the job doesn't get killed correctly (not sure why), and then the only solution is to ssh to the exec host and kill it there [14:42:14] qstat shows it's running on tools-exec-1207 [14:42:33] ...which doesn't seem to be responding, which might be part of the issue [14:44:30] 06Labs, 10Tool-Labs: tools-exec-1207 hanging - https://phabricator.wikimedia.org/T136481#2336607 (10valhallasw) [14:45:04] valhallasw`cloud: I gess the node went out-of-memoy and probably in a freeze [14:46:01] why do you think so? [14:46:10] 06Labs, 10Tool-Labs: tools-exec-1207 hanging - https://phabricator.wikimedia.org/T136481#2336625 (10valhallasw) [14:46:29] valhallasw`cloud: because I get an error in the job log about "out of memory" [14:52:19] 06Labs, 10Tool-Labs: tools-exec-1207 hanging - https://phabricator.wikimedia.org/T136481#2336634 (10valhallasw) ``` 16:45 valhallasw`cloud: I gess the node went out-of-memoy and probably in a freeze 16:45 why do you think so? 16:46 valhallasw`cloud: because I get an error... [15:12:37] Kelson: sorry, forgot to also mention it here -- I killed the job (or rather, removed it from SGE's knowledge), so you should be able to restart it [16:15:03] 06Labs, 10Tool-Labs: tools-exec-1207 hanging - https://phabricator.wikimedia.org/T136481#2336725 (10chasemp) I would reboot this for now, fairly comfortable saying this is likely nfs maint fallout. Thanks [16:21:20] 06Labs, 10Tool-Labs: tools-exec-1207 hanging - https://phabricator.wikimedia.org/T136481#2336726 (10valhallasw) Rebooted via wikitech; should be online again in a short while. [16:24:20] RECOVERY - SSH on tools-exec-1207 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2~wmfprecise2 (protocol 2.0) [16:30:40] \o/ [16:31:31] 06Labs, 10Labs-Infrastructure, 07Tracking: Labs instances sometimes freeze - https://phabricator.wikimedia.org/T124133#2336736 (10valhallasw) [16:31:33] 06Labs, 10Tool-Labs: tools-exec-1207 hanging - https://phabricator.wikimedia.org/T136481#2336733 (10valhallasw) 05Open>03Resolved a:03valhallasw Jobs are being scheduled again. [17:43:37] valhallasw`cloud: thx a lot! [18:07:30] Kelson: yw! Sorry for the downtime :-) [18:15:46] valhallasw`cloud: non problem, [18:21:52] I still can't create new DNS in labs: https://horizon.wikimedia.org/project/proxy/ [18:22:10] "Something went wrong! [18:22:10] An unexpected error has occurred. Try refreshing the page. If that doesn't help, contact your local administrator." [18:22:22] do you have an ETA when this will be solved? [18:23:35] Amir1: please file a bug in the #labs project [18:23:53] I thought it's a known issue [18:24:03] sure [18:27:06] 06Labs, 10Horizon: DNS dashboard in horizon is broken - https://phabricator.wikimedia.org/T136489#2336823 (10Ladsgroup) [18:40:26] Amir1: which project, which proxy, ... etc? [18:41:29] Amir1: even if it's not specific for one project, having all the details makes reproducing issues much easier [18:50:08] valhallasw`cloud: I can't even get to dashboard of changing the DNS [18:50:54] so, basically no proxy. And about project I mentioned it was "ores-staging" [19:01:04] novaproxy-01 is a mess. [19:07:45] Amir1, okay [19:08:05] nope, different error now [19:13:57] thanks Krenair :) [19:21:18] Amir1, I think I've dealt with it. [19:21:41] awesome [19:21:44] let me check [19:21:57] Krenair: yup, it's fixed [19:23:26] 06Labs, 10Horizon: DNS dashboard in horizon is broken - https://phabricator.wikimedia.org/T136489#2336881 (10AlexMonk-WMF) a:03AlexMonk-WMF novaproxy-01 is a huge mess (stopped listening on :5668 for some reason, also two invisible-unicorn services and two different installations) and I hate uwsgi. [19:27:03] 06Labs, 10Horizon: DNS dashboard in horizon is broken - https://phabricator.wikimedia.org/T136489#2336886 (10AlexMonk-WMF) 05Open>03Resolved To force the puppetised version to use python2 to load I had to move the python3 plugins out of /usr/lib/uwsgi/plugins/ into a 'disabled' subdirectory. There might h... [19:28:32] 06Labs, 10Labs-Infrastructure: Clean up novaproxy-01 - https://phabricator.wikimedia.org/T136492#2336888 (10AlexMonk-WMF) [19:30:36] Krenair: I'm not too familiar with DNS system in labs but I have some experience in uwsgi, I might be able help fixing the issue. Can you elaborate more? (if possible) or direct me to the source code [19:31:03] The problem is not with invisible-unicorn itself [19:32:17] okay cool [19:36:32] Basically we have a horizon plugin (and formerly, OpenStackManager MW extension code) that connects to a dynamicproxy-api (aka invisible-unicorn) service running on the host specified in the labs openstack service/endpoint list [19:37:57] This host is currently a labs instance known as novaproxy-01 in the project 'project-proxy'. Given the abuse people could theoretically do with access to this it's restricted, NDA required etc. [19:39:19] It runs dynamicproxy-api, but has two copies installed, probably due to https://gerrit.wikimedia.org/r/#/c/251176/ [19:39:57] the one it had running broke recently (today? not sure) and I've been meaning to make it use the puppetised one anyway, which is now done. [19:40:25] For some reason uwsgi read the ini file and decided that despite the 'plugins' line, it was going to use python3 anyway, which isn't compatible with the current code [19:42:03] I guess we could update it to python3 code and then uwsgi would happen to choose the correct version [19:42:37] but it'd require flask installed for python3 (it's only installed for python2 right now) and probably various other things so I decided to keep simple for now [19:45:11] Krenair: thanks for the help, let me check the puppet modules for dynamicproxy [19:52:25] I have an extra instance for dynamicproxy-api in the 'openstack' project that performs the same role for the labtest cluster (instead of labs) [20:07:24] 06Labs, 10Tool-Labs: toolsbeta grid misconfigured - https://phabricator.wikimedia.org/T136433#2336917 (10valhallasw) p:05Triage>03High If I read the puppet manifests correctly, this configuration (of hostgroups and queues) should happen automatically. If that's not the case, that has a large impact on our... [20:43:49] 10Quarry: show time of execution in quarry - https://phabricator.wikimedia.org/T136266#2336927 (10matej_suchanek) [20:43:53] 10Quarry, 07Easy: Display time taken to execute a query - https://phabricator.wikimedia.org/T135189#2336928 (10matej_suchanek) [20:43:57] 10Quarry: Include query execution time - https://phabricator.wikimedia.org/T126888#2336930 (10matej_suchanek) [20:44:21] 10Quarry: Show the execution time in the table of queries - https://phabricator.wikimedia.org/T71264#2336931 (10matej_suchanek) [20:44:25] 10Quarry: Include query execution time - https://phabricator.wikimedia.org/T126888#2025894 (10matej_suchanek) [20:44:52] 10Quarry: Show the execution time in the table of queries - https://phabricator.wikimedia.org/T71264#718760 (10matej_suchanek) [20:52:56] 06Labs, 10Labs-Infrastructure: Clean up novaproxy-01 - https://phabricator.wikimedia.org/T136492#2336888 (10Ladsgroup) The most robust way to solve this IMO is first Migrating to python3 (which is pretty simple) and then migrating to use venv. So we would know for sure which version of python is being used. [20:58:09] 06Labs, 10Tool-Labs: hhvm downgrade breaks puppet on tools-bastion-02 - https://phabricator.wikimedia.org/T136494#2336959 (10valhallasw) [20:58:43] 06Labs, 10Tool-Labs: Stale NFS handle breaks puppet on tools-exec-1215 - https://phabricator.wikimedia.org/T136495#2336972 (10valhallasw) [20:59:15] 06Labs, 10Tool-Labs: Stale NFS handle breaks puppet on tools-exec-1204, -1205 and -1218 - https://phabricator.wikimedia.org/T136495#2336985 (10valhallasw) [21:00:40] 06Labs, 10Tool-Labs: puppet failing on tools-k8s-bastion-01 (kubeconfig lock files) - https://phabricator.wikimedia.org/T136496#2336986 (10valhallasw) [21:02:34] 06Labs, 10Tool-Labs: ssh not responding on tools-pastion-01 - https://phabricator.wikimedia.org/T136497#2337003 (10valhallasw) [21:03:44] 06Labs, 10Tool-Labs: puppet disabled on tools-prometheus-01 - https://phabricator.wikimedia.org/T136498#2337016 (10valhallasw) [21:04:54] 06Labs, 10Tool-Labs: Puppet failing on tools-webgrid-generic-1405 - https://phabricator.wikimedia.org/T136499#2337029 (10valhallasw) [21:05:47] 06Labs, 10Tool-Labs: ssh closes connection on tools-webgrid-lighttpd-1408 - https://phabricator.wikimedia.org/T136500#2337042 (10valhallasw) [21:08:14] 06Labs, 10Tool-Labs: Fix 'unknown's in shinken - https://phabricator.wikimedia.org/T99072#2337056 (10valhallasw) There's a large number of failures again. Most of them seem `service check timed out`s, probably caused by {T127957}. [21:15:42] PROBLEM - Puppet run on tools-exec-1214 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [21:16:29] that's me. [21:16:48] (running puppet manually, so the cron one fails) [21:20:38] 06Labs, 10Tool-Labs: Stale NFS handle breaks puppet on tools-exec-1204, -1205 and -1218 - https://phabricator.wikimedia.org/T136495#2337062 (10valhallasw) Disabled queues on the three hosts, rescheduled continuous jobs, waiting for the rest to drain: ``` HOSTNAME ARCH NCPU LOAD MEMTOT... [21:21:15] !log tools rebooting tools-exec-1204 (T136495) [21:21:16] T136495: Stale NFS handle breaks puppet on tools-exec-1204, -1205 and -1218 - https://phabricator.wikimedia.org/T136495 [21:21:19] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [21:23:16] 06Labs, 10Tool-Labs: Stale NFS handle breaks puppet on tools-exec-1204, -1205 and -1218 - https://phabricator.wikimedia.org/T136495#2337066 (10valhallasw) tools-exec-1204 is back up and running, the other two hosts still need to drain before they can be rebooted. [21:25:40] RECOVERY - Puppet run on tools-exec-1214 is OK: OK: Less than 1.00% above the threshold [0.0] [21:30:05] hi :). If I want to add a security rule for ssh at wikitech, which protocol I have to choose? imcp, tcp, udb? or is that only possible via horizon? [21:36:23] chasemp: I wonder if we can't change which security rule belongs to which instance after creation. Works currently at horizon ;) [21:37:02] Luke081515: tcp port 22 [21:37:13] ok, thx [21:37:18] I think that should be open in the default security group? [21:37:48] Luke081515: and I don't get your second comment. Yes, it works in Horizon and is not supported on Wikitech. So use Horizon? [21:38:31] valhallasw`cloud: I just told that because IIRC chasemp said last time, that it is not possible ;) [21:42:33] 06Labs, 10Tool-Labs: ssh closes connection on tools-webgrid-lighttpd-1408 - https://phabricator.wikimedia.org/T136500#2337077 (10valhallasw) All jobs are in `deleted` state but still running there. I'm just going to reboot the host, webservicewatcher will bring the jobs back online. [21:45:26] 06Labs, 10Tool-Labs: ssh closes connection on tools-webgrid-lighttpd-1408 - https://phabricator.wikimedia.org/T136500#2337079 (10valhallasw) 05Open>03Resolved a:03valhallasw Host is back online. [21:45:52] 06Labs, 10Tool-Labs: ssh not responding on tools-pastion-01 - https://phabricator.wikimedia.org/T136497#2337082 (10valhallasw) 05Open>03Resolved a:03valhallasw Host rebooted and back online. [21:46:04] RECOVERY - SSH on tools-webgrid-lighttpd-1408 is OK: SSH OK - OpenSSH_6.9p1 Ubuntu-2~trusty1 (protocol 2.0) [21:48:18] RECOVERY - SSH on tools-pastion-01 is OK: SSH OK - OpenSSH_6.9p1 Ubuntu-2~trusty1 (protocol 2.0) [21:50:17] PROBLEM - Puppet staleness on tools-webgrid-lighttpd-1408 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [43200.0] [21:50:51] PROBLEM - Puppet run on tools-pastion-01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [21:51:07] PROBLEM - Puppet run on tools-webgrid-lighttpd-1408 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [21:51:59] !log tools rebooted tools-webgrid-lighttpd-1408, tools-pastion-01, tools-exec-1205 [21:52:03] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [21:52:32] 06Labs, 10Tool-Labs: Puppet failing on tools-webgrid-generic-1405 - https://phabricator.wikimedia.org/T136499#2337086 (10valhallasw) 05Open>03Resolved a:03valhallasw Disks look OK. I ran ``` valhallasw@tools-webgrid-generic-1405:~$ sudo dpkg --configure -a ``` which didn't solve anything. ``` sudo apt... [21:53:49] PROBLEM - Puppet staleness on tools-pastion-01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [43200.0] [21:54:31] 06Labs, 10Tool-Labs: Stale NFS handle breaks puppet on tools-exec-1204, -1205 and -1218 - https://phabricator.wikimedia.org/T136495#2337089 (10valhallasw) tools-exec-1205 has also been rebooted and reenabled. [22:00:10] RECOVERY - Puppet staleness on tools-webgrid-lighttpd-1408 is OK: OK: Less than 1.00% above the threshold [3600.0] [22:01:24] RECOVERY - Puppet run on tools-webgrid-generic-1405 is OK: OK: Less than 1.00% above the threshold [0.0] [22:14:14] !log wikilabels deploying 6065701 to prod [22:14:17] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Wikilabels/SAL, Master [22:25:12] !log wikilabels deploy d13dd3c into prod [22:25:15] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Wikilabels/SAL, Master