[00:10:43] (03CR) 10Jean-Frédéric: "* The toolbox all in all was a worthy attempt − to get nice, unified UI− but it never really took off. I actually think our best chance t" [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/303933 (owner: 10EdouardHue) [00:25:13] PROBLEM - Host tools-worker-1024 is DOWN: CRITICAL - Host Unreachable (10.68.22.168) [00:26:06] ^ is me, is ok [00:53:45] (03CR) 10Legoktm: [C: 032] -releng: remove #browser-tests, fix -infra [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/304041 (owner: 10Greg Grossmeier) [01:03:12] (03Merged) 10jenkins-bot: -releng: remove #browser-tests, fix -infra [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/304041 (owner: 10Greg Grossmeier) [01:23:33] 06Labs: pagecount-dumps at /public/dumps/pagecounts-* has stopped - https://phabricator.wikimedia.org/T142671#2542679 (10Stigmj) [03:24:58] 10Labs-project-other: Successful pilot of Discourse on https://discourse.wmflabs.org/ as an alternative to wikimedia-l mailinglist - https://phabricator.wikimedia.org/T124690#2542742 (10Tgr) Convenience link: https://lists.wikimedia.org/mailman/listinfo/discourse [03:38:01] PROBLEM - Puppet run on tools-exec-1201 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [03:38:13] PROBLEM - Puppet run on tools-exec-1407 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [03:38:21] PROBLEM - Puppet run on tools-webgrid-lighttpd-1206 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [03:38:25] PROBLEM - Puppet run on tools-exec-1210 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [03:39:37] PROBLEM - Puppet run on tools-checker-02 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [03:42:02] PROBLEM - Puppet run on tools-mail-01 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [03:43:12] PROBLEM - Puppet run on tools-webgrid-generic-1405 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [03:43:14] PROBLEM - Puppet run on tools-exec-1410 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [03:43:22] PROBLEM - Puppet run on tools-webgrid-lighttpd-1210 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [03:43:26] PROBLEM - Puppet run on tools-exec-1214 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [03:43:38] PROBLEM - Puppet run on tools-exec-1406 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [03:43:52] PROBLEM - Puppet run on tools-services-01 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [03:44:00] PROBLEM - Puppet run on tools-exec-1409 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [03:44:34] This is all me, fixing [03:44:36] PROBLEM - Puppet run on tools-webgrid-lighttpd-1403 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [03:44:36] PROBLEM - Puppet run on tools-webgrid-lighttpd-1412 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [03:44:50] PROBLEM - Puppet run on tools-services-02 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [03:46:10] PROBLEM - Puppet run on tools-exec-1216 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [03:47:42] PROBLEM - Puppet run on tools-bastion-02 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [03:47:52] PROBLEM - Puppet run on tools-mail is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [03:47:53] PROBLEM - Puppet run on tools-exec-1208 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [03:47:53] PROBLEM - Puppet run on tools-webgrid-lighttpd-1205 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [03:48:15] PROBLEM - Puppet run on tools-exec-1212 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [03:48:15] PROBLEM - Puppet run on tools-exec-gift is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [03:48:25] PROBLEM - Puppet run on tools-webgrid-lighttpd-1209 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [03:48:37] PROBLEM - Puppet run on tools-exec-1217 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [03:49:48] they all gonna recover [03:50:06] PROBLEM - Puppet run on tools-webgrid-lighttpd-1415 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [03:50:27] PROBLEM - Puppet run on tools-bastion-03 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [03:53:57] PROBLEM - Puppet run on tools-exec-1215 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [03:58:00] RECOVERY - Puppet run on tools-exec-1201 is OK: OK: Less than 1.00% above the threshold [0.0] [03:59:36] RECOVERY - Puppet run on tools-webgrid-lighttpd-1403 is OK: OK: Less than 1.00% above the threshold [0.0] [04:17:03] RECOVERY - Puppet run on tools-mail-01 is OK: OK: Less than 1.00% above the threshold [0.0] [04:18:13] RECOVERY - Puppet run on tools-exec-1407 is OK: OK: Less than 1.00% above the threshold [0.0] [04:18:23] RECOVERY - Puppet run on tools-webgrid-lighttpd-1206 is OK: OK: Less than 1.00% above the threshold [0.0] [04:18:26] RECOVERY - Puppet run on tools-exec-1210 is OK: OK: Less than 1.00% above the threshold [0.0] [04:18:27] RECOVERY - Puppet run on tools-exec-1214 is OK: OK: Less than 1.00% above the threshold [0.0] [04:18:36] RECOVERY - Puppet run on tools-exec-1406 is OK: OK: Less than 1.00% above the threshold [0.0] [04:18:53] RECOVERY - Puppet run on tools-services-01 is OK: OK: Less than 1.00% above the threshold [0.0] [04:19:39] RECOVERY - Puppet run on tools-checker-02 is OK: OK: Less than 1.00% above the threshold [0.0] [04:19:51] RECOVERY - Puppet run on tools-services-02 is OK: OK: Less than 1.00% above the threshold [0.0] [04:21:07] RECOVERY - Puppet run on tools-exec-1216 is OK: OK: Less than 1.00% above the threshold [0.0] [04:22:41] RECOVERY - Puppet run on tools-bastion-02 is OK: OK: Less than 1.00% above the threshold [0.0] [04:22:53] RECOVERY - Puppet run on tools-webgrid-lighttpd-1205 is OK: OK: Less than 1.00% above the threshold [0.0] [04:22:53] RECOVERY - Puppet run on tools-exec-1208 is OK: OK: Less than 1.00% above the threshold [0.0] [04:22:53] RECOVERY - Puppet run on tools-mail is OK: OK: Less than 1.00% above the threshold [0.0] [04:23:11] RECOVERY - Puppet run on tools-webgrid-generic-1405 is OK: OK: Less than 1.00% above the threshold [0.0] [04:23:11] RECOVERY - Puppet run on tools-exec-gift is OK: OK: Less than 1.00% above the threshold [0.0] [04:23:15] RECOVERY - Puppet run on tools-exec-1212 is OK: OK: Less than 1.00% above the threshold [0.0] [04:23:15] RECOVERY - Puppet run on tools-exec-1410 is OK: OK: Less than 1.00% above the threshold [0.0] [04:23:21] RECOVERY - Puppet run on tools-webgrid-lighttpd-1210 is OK: OK: Less than 1.00% above the threshold [0.0] [04:23:23] RECOVERY - Puppet run on tools-webgrid-lighttpd-1209 is OK: OK: Less than 1.00% above the threshold [0.0] [04:23:35] RECOVERY - Puppet run on tools-exec-1217 is OK: OK: Less than 1.00% above the threshold [0.0] [04:23:58] RECOVERY - Puppet run on tools-exec-1215 is OK: OK: Less than 1.00% above the threshold [0.0] [04:24:01] RECOVERY - Puppet run on tools-exec-1409 is OK: OK: Less than 1.00% above the threshold [0.0] [04:24:34] RECOVERY - Puppet run on tools-webgrid-lighttpd-1412 is OK: OK: Less than 1.00% above the threshold [0.0] [04:25:27] RECOVERY - Puppet run on tools-bastion-03 is OK: OK: Less than 1.00% above the threshold [0.0] [04:30:03] RECOVERY - Puppet run on tools-webgrid-lighttpd-1415 is OK: OK: Less than 1.00% above the threshold [0.0] [05:39:11] PROBLEM - Puppet run on tools-prometheus-02 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [07:02:19] 06Labs, 10Tool-Labs: Maintainers are not shown in the Tools list - https://phabricator.wikimedia.org/T142684#2542979 (104nn1l2) [07:10:12] RECOVERY - Puppet staleness on tools-proxy-02 is OK: OK: Less than 1.00% above the threshold [3600.0] [09:32:07] (03PS1) 10Giuseppe Lavagetto: adding puppetdb credentials stub [labs/private] - 10https://gerrit.wikimedia.org/r/304191 [09:32:35] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] adding puppetdb credentials stub [labs/private] - 10https://gerrit.wikimedia.org/r/304191 (owner: 10Giuseppe Lavagetto) [09:45:00] hi (sorry newbie question), when I want to add new instance Instance type I see "0 GB storage" [09:45:21] does it mean I ran out of quote in my project or is it "normal"? [09:45:46] note that I've just deleted 4 m1.wlarge instances in my project [12:29:10] 06Labs, 10Dumps-Generation: pagecount-dumps at /public/dumps/pagecounts-* has stopped - https://phabricator.wikimedia.org/T142671#2543735 (10valhallasw) [13:04:21] 06Labs, 10Dumps-Generation: pagecount-dumps at /public/dumps/pagecounts-* has stopped - https://phabricator.wikimedia.org/T142671#2542679 (10Ottomata) See this announcement on the Analytics public mailing list: https://lists.wikimedia.org/pipermail/analytics/2016-August/005339.html [13:24:15] dcausse: that would be odd as storage isn't directly quota'd [13:24:29] where you do see that / what are you doing to see that? [13:24:54] chasemp: it was when creating a new instance in the drop down menu when you choose the vm size [13:25:12] dcausse: screenshot? [13:25:18] chasemp: sure [13:25:50] dcausse: we recently added some new flavors I think and maybe this is a side effect [13:26:22] chasemp: in the end I had no problem, the vm is working fine [13:26:29] ah ok [13:27:04] still interested if you don't mind ss of the screen and seeing if it's the same now [13:28:36] chasemp: https://phabricator.wikimedia.org/F4353571 [13:28:57] dcausse: ah, I think this is for attached storage which we don't do atm [13:29:10] ah ok, sorry then :) [13:34:40] no worries thanks for asking [14:22:45] chasemp: anyway I can strace a job of mine on grid? [14:23:00] (tool labs) [14:24:14] or yuvipanda [14:33:03] zhuyifei1999_: the simplest thing is to run it on the bastion to debug briefly, we restrict some resource stuff there and actively discourage long running procs but that is waht I do [14:33:20] there is a way to reach out to a particular exec I believe but I can't recall specifics of the setup [14:34:12] well, I have two job that usually work, but failing right now. [14:34:40] restarting on bastion cannot reproduce whatever is going on [14:35:14] valhallasw`cloud: do you recall teh details on host to host auth so zhuyifei1999_ could reach and and strace that job where it is running?^ [14:35:32] I can ssh in, but /proc/sys/kernel/yama/ptrace_scope is 1 :( [14:36:41] asking whether stracing own jobs should / is supported [14:37:23] I'm not sure historically what the answer there is, have to run here unfortunately, can you make a task and ping me and yuvi? we'll figure something out [14:38:11] ok [14:44:42] zhuyifei1999_: it's off by default on multi-user systems because it's a security risk, but I don't know the details [14:47:07] yeah ik [14:48:05] it's meant to prevent hackers from reading data from processes the account the hacker hacked into [14:53:36] 06Labs, 10Tool-Labs: possibility to strace one's own jobs on tool labs - https://phabricator.wikimedia.org/T142715#2544106 (10zhuyifei1999) [14:56:17] 06Labs, 10Tool-Labs: allow tool users to attach strace to their processes (at least on exec hosts) - https://phabricator.wikimedia.org/T114401#2544137 (10valhallasw) [14:56:19] 06Labs, 10Tool-Labs: possibility to strace one's own jobs on tool labs - https://phabricator.wikimedia.org/T142715#2544139 (10valhallasw) [14:57:31] valhallasw`cloud: can you strace the process mentioned in the ticket? [14:57:35] sure [14:58:13] thx [14:58:28] select(0, NULL, NULL, NULL, {0, 244841}) = 0 (Timeout) [14:58:28] select(0, NULL, NULL, NULL, {0, 250000}) = 0 (Timeout) [14:58:29] select(0, NULL, NULL, NULL, {0, 250000}) = 0 (Timeout) [14:58:29] ad infinitum [14:58:40] let me see if I can figure out how to get a python stacktrace from gdb [14:58:43] it has a zombie child process, for idk-reason [15:01:15] 'unable to read debugging information' [15:01:15] bah. [15:05:43] zhuyifei1999_: so ctrl-C might be the most informative option after all [15:06:03] does python sleep via selects? o.O [15:06:22] https://www.irccloud.com/pastebin/K78cDDXZ/ [15:06:46] #python ;-) [15:07:23] yeah, killed another job, found it stuck in time.sleep [15:07:27] 06Labs, 10Labs-Infrastructure, 10DBA: labsdb* has no automatic failover solution - https://phabricator.wikimedia.org/T141097#2544182 (10jcrespo) [15:08:59] whatever, SIGINT-ed [15:11:58] zhuyifei1999_: stuck in time.sleep() or just in a sleep loop? [15:13:29] valhallasw`cloud: I'm using a pywikibot from march. in master the line corresponds to https://github.com/wikimedia/pywikibot-core/blob/master/pywikibot/site.py#L1061 [15:13:57] *March [15:14:20] zhuyifei1999_: sounds like some sort of a deadlock [15:14:31] yeah [15:14:38] are you doing anything multithreaded? [15:15:35] https://github.com/wikimedia/pywikibot-core/blob/master/pywikibot/site.py#L4839 < that doesn't look quite right [15:15:48] there's a whole lot of things that can go wrong between the lock and the unlock [15:16:09] the /data/project/yifeibot/pywikibot/com_info_parserror.py is not multithreaded [15:31:54] 06Labs, 10Labs-Infrastructure: Track labs instances hanging - https://phabricator.wikimedia.org/T141673#2507274 (10AlexMonk-WMF) deployment-cache-upload04, still successfully running upload.beta.wmflabs.org on HTTP/HTTPS and responding to pings, [15:49:19] 06Labs, 10Tool-Labs, 10WLX-Jury, 13Patch-For-Review, 10Wiki-Loves-Monuments (2016): Figure out a way to support java 1.8 on tool labs (For WLX Jury) - https://phabricator.wikimedia.org/T124903#2544331 (10LilyOfTheWest) @intracer if you have tried the steps in https://wikitech.wikimedia.org/wiki/User:Sn1p... [16:37:22] RECOVERY - Host tools-secgroup-test-103 is UP: PING OK - Packet loss = 0%, RTA = 0.88 ms [16:38:10] RECOVERY - Host secgroup-lag-102 is UP: PING OK - Packet loss = 0%, RTA = 0.60 ms [16:38:38] RECOVERY - Host tools-secgroup-test-102 is UP: PING OK - Packet loss = 0%, RTA = 1.15 ms [16:42:00] PROBLEM - Host secgroup-lag-102 is DOWN: CRITICAL - Host Unreachable (10.68.17.218) [16:43:18] PROBLEM - Host tools-secgroup-test-103 is DOWN: CRITICAL - Host Unreachable (10.68.21.22) [16:45:30] PROBLEM - Host tools-secgroup-test-102 is DOWN: CRITICAL - Host Unreachable (10.68.21.170) [17:00:18] RECOVERY - Puppet staleness on tools-proxy-01 is OK: OK: Less than 1.00% above the threshold [3600.0] [17:22:07] !log tools reboot via nova master as it is stuck [17:22:12] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [17:23:12] chasemp you forgot to say what was actually being rebooted ;) [17:23:22] !log tools instance being rebooted is tools-grid-master [17:23:26] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [17:23:45] reboot master via nova as it is stuck? :) [17:45:27] truth is I forgot the word reboot and then tried to insert it and it came out like yoda [17:46:02] hehe :D [17:46:10] chasemp did you find anything there? [17:48:00] well can you get to tools-grid-master? [17:48:02] yuvipanda: ^ [17:48:17] chasemp oh I thought you rebooted it? [17:48:17] let me try [17:48:22] I did [17:48:27] oh [17:48:29] that's the scary part [17:48:30] wtf [17:48:43] ok I can still ping it [17:48:45] still says rebooting [17:48:52] may have never transitioned [17:49:17] I'm not convinced this is the same issue yet, we had this happend once before and it was general grid master weirdness and nfs issues iirc [17:49:20] but I don't know really [17:49:23] but this is odd as hell [17:49:54] chasemp yah. [17:51:07] chasemp so gridengine still works [17:51:21] it never went down [17:51:23] must be but why [17:51:25] right [17:51:30] | 64f01f90-c805-4a2e-9ed5-f523b909094e | tools-grid-master | tools | REBOOT | rebooting | Running | public=10.68.20.158 [17:51:38] from [17:51:38] | 64f01f90-c805-4a2e-9ed5-f523b909094e | tools-grid-master | tools | ACTIVE | - | Running | public=10.68.20.158 [17:51:41] so it definitely took [17:51:42] yeah [17:51:49] so one hypothesis is [17:51:56] that nova's lagging intensely in some ways [17:52:11] generalizing a bunch of stuff from the last few days [17:52:23] RECOVERY - Puppet staleness on tools-webgrid-lighttpd-1208 is OK: OK: Less than 1.00% above the threshold [3600.0] [17:52:37] (I'm in the meantime killing stale puppet alerts) [17:52:55] RECOVERY - Puppet staleness on tools-exec-1211 is OK: OK: Less than 1.00% above the threshold [3600.0] [17:57:00] RECOVERY - Puppet staleness on tools-exec-1213 is OK: OK: Less than 1.00% above the threshold [3600.0] [17:57:03] 10Labs-project-Wikistats: wikistats (labs project): remove test wikipedias from wikipedia table - https://phabricator.wikimedia.org/T142730#2544799 (10Dzahn) [17:57:26] 10Labs-project-Wikistats: wikistats (labs project): remove test wikipedias from wikipedia table - https://phabricator.wikimedia.org/T142730#2544814 (10Dzahn) p:05Triage>03Normal [17:58:17] 10Labs-project-Wikistats: wikistats (labs project): remove test wikipedias from wikipedia table - https://phabricator.wikimedia.org/T142730#2544799 (10Dzahn) this happened in T140970 recently [17:59:41] chasemp should we do the 'reset state' thing and try rebooting? [17:59:43] 10Labs-project-Wikistats: wikistats (labs project): remove test wikipedias from wikipedia table - https://phabricator.wikimedia.org/T142730#2544833 (10Dzahn) https://phabricator.wikimedia.org/T140970#2534684 ``` I checked the diff between the list of wikipedia prefixes i get from curl https://meta.wikimedia.... [17:59:51] yeah I think so I think this is a nova issue [17:59:57] and it actually happened yesterday too [18:00:48] chasemp with CI? [18:01:15] phlogiston instance did the same exact thing [18:02:12] RECOVERY - Puppet staleness on tools-exec-1204 is OK: OK: Less than 1.00% above the threshold [3600.0] [18:02:23] yuvipanda: happen to know that command? [18:02:40] 10Labs-project-Wikistats: wikistats (labs project): rename Image column to Files - https://phabricator.wikimedia.org/T142732#2544838 (10Dzahn) [18:02:42] chasemp nope, andrewbogott said it yesterday, trying to find [18:03:00] 10Labs-project-Wikistats: wikistats (labs project): rename Image column to Files - https://phabricator.wikimedia.org/T142732#2544851 (10Dzahn) p:05Triage>03Normal [18:04:39] chasemp nova reset-state —active uuid [18:04:50] yep [18:05:18] fwiw two dashes before active (didn't translate well) [18:06:32] chasemp ah, right. [18:06:51] for posterity because we'll have to find it again in irc logs :) [18:07:01] hehe [18:07:02] should write it down somewhere [18:07:19] chasemp I'm writing it down in https://wikitech.wikimedia.org/wiki/OpenStack [18:07:20] this messge changed in liberty I think [18:07:20] Request to reboot server has been accepted. [18:07:30] never used to give you the human name [18:09:09] t down) [18:09:18] (I wrote i [18:09:18] lol [18:09:19] ok [18:12:56] 10Labs-project-Wikistats: wikistats: add tcy.wikipedia, jam.wikipedia (and check for other missing ones) - https://phabricator.wikimedia.org/T140970#2544886 (10Dzahn) >>! In T140970#2534694, @Krenair wrote: > Should this be added to the Add_a_wiki docs? https://wikitech.wikimedia.org/w/index.php?title=Add_a_wik... [18:20:08] 06Labs, 10Tool-Labs: redis diamond collector not running/reporting on tools-redis-1001/2 - https://phabricator.wikimedia.org/T142735#2544920 (10valhallasw) [18:20:15] RECOVERY - Puppet staleness on tools-webgrid-lighttpd-1207 is OK: OK: Less than 1.00% above the threshold [3600.0] [18:46:39] 06Labs: (Re-)Create Gitblit->Phabricator testing instance on Labs - https://phabricator.wikimedia.org/T142186#2544989 (10Dzahn) 05stalled>03Open https://wikitech.wikimedia.org/w/index.php?title=Nova_Resource:Gitblit/Documentation&oldid=817090 [18:46:41] 06Labs, 10Labs-Infrastructure: Creating new instance failed - https://phabricator.wikimedia.org/T136656#2544991 (10Dzahn) [18:58:11] 06Labs: (Re-)Create Gitblit->Phabricator testing instance on Labs - https://phabricator.wikimedia.org/T142186#2545015 (10Dzahn) @Danny_B Here you go. The project name is "gitblit", i made it just for this purpose and added you as contact. I created an instance called "danny". You are a project admin. You can... [18:58:54] 06Labs, 10Tool-Labs: redis diamond collector not running/reporting on tools-redis-1001/2 - https://phabricator.wikimedia.org/T142735#2545019 (10valhallasw) I think this is due to the move from tools-redis -> redis::legacy -> redis::instance: * {rOPUPad641da2f4a2f08e501187b13805f5b19a3cf748} * {rOPUP0cb76d7db8... [19:00:15] 06Labs: (Re-)Create Gitblit->Phabricator testing instance on Labs - https://phabricator.wikimedia.org/T142186#2545035 (10Danny_B) Thank you very much. It will help me to speed up resolving of some #gitblit-deprecate tasks. [19:02:32] yuvipanda: could you check'n'+2 https://gerrit.wikimedia.org/r/304295 ? [19:02:44] ugh there's a , missing [19:03:45] valhallasw`cloud sure! Waiting for jenkins... [19:05:29] jenkins likes me today! [19:05:52] valhallasw`cloud also, thoughts on https://phabricator.wikimedia.org/T142452? [19:06:02] yuvipanda: +2 [19:06:13] valhallasw`cloud can you put it in the task? :D [19:07:10] 06Labs, 10Tool-Labs: Move all of tool labs to project puppetmaster - https://phabricator.wikimedia.org/T142452#2545062 (10valhallasw) I prefer this solution over moving everything to the central puppetmaster. One very clear advantage is making the puppetmaster accessible to non-ops admins, which could help bot... [19:11:03] PROBLEM - Puppet run on tools-redis-1002 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [19:11:18] well, I think I broke something [19:11:31] no, that's just me running a manual run and aborting [19:11:51] valhallasw`cloud because gerrit AKJHFOKDHGAOLHDG I didn't hit 'submit' [19:11:52] just did [19:12:01] ah. no auto-submit? [19:12:20] valhallasw`cloud no. [19:13:24] valhallasw`cloud chasemp I'm going to move puppetmasters sometime next week then. [19:13:36] and do it slowly so we don't overload it [19:13:44] for all of tools you mean? [19:14:04] yeah [19:14:10] blegh puppet [19:14:33] Error: Failed to apply catalog: Could not find dependency Class[::Redis::Instance] for Diamond::Collector[Redis] at /etc/puppet/modules/toollabs/manifests/redis.pp:70 [19:14:42] ...so then how /am/ I supposed to set that dependency >_< [19:15:35] valhallasw`cloud: if you include the manifest where that is defined [19:15:50] you can reference it for dep? but nto sure what exactly you are running up against [19:16:17] class toollabs::redis { include ::redis::client::python ; redis::instance { ... }; diamond::collector { 'Redis': require => Class['::redis::instance'] } [19:16:23] chasemp btw, i bumped log collection retention to 14d since we have the capacity, and it's also running in two hosts for redundancy (tools-logs-01 and -02) [19:16:40] so that should require ::redis::client::python instead? not sure how the ordering works then [19:17:48] yuvipanda: how does it sync between them? or does it [19:18:10] chasemp nope, all instances just send it to both places [19:18:14] valhallasw`cloud: can't follow that well, changeset link or file? [19:18:18] yuvipanda: ah ok simple [19:18:21] yeah [19:18:26] chasemp same for prometheus [19:18:37] this is basically how I'm going to deal with random instances freezing :P [19:18:47] chasemp: https://gerrit.wikimedia.org/r/304295 , https://gerrit.wikimedia.org/r/304298 ; I think the latter is probably the correct one [19:19:08] because redis::instance is actually a define and not a class [19:19:37] (and the point is that it requires the python-redis package ,not so much the server itself) [19:20:16] valhallasw`cloud: I would write that second one require => Class['Redis::Client::Python'], [19:20:16] I think? [19:20:53] is that different? this is how it's included at the top [19:20:56] * valhallasw`cloud doesn't get puppet logic [19:21:05] yeah the caps matter here as in [19:21:13] require => File['foo'] vs [19:21:18] require file['foo'] [19:21:30] and you don't need teh leading '::' because they are both refd from inside of modules [19:22:13] and iirc the caps translate through laters so Redis::Client [19:22:33] I'm going to go tobart to go to office [19:23:32] PROBLEM - Puppet run on tools-redis-1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [19:23:34] chasemp: mm, so I think this is how it was originally: https://phabricator.wikimedia.org/rOPUP0cb76d7db83fc92fcdd1a07a39933c490f9dd48d [19:24:06] huh, yeah I'm surprised that works I guess but not shocked [19:24:56] so that's https://gerrit.wikimedia.org/r/#/c/304298/. I'll also prepare one with capitals for when that also fails >_< [19:28:15] valhallasw`cloud: is this fixing a dep cycle issue now? [19:29:06] chasemp: fixing a dependency (which is now breaking puppet on tools-redis-*) [19:34:02] valhallasw`cloud: I'm trying to merge it and gerrit is giving me hell [19:35:07] valhallasw`cloud: ok let me know if that works out [19:35:10] thanks! [19:35:18] * valhallasw`cloud runs puppet [19:35:30] yep, seems to work :-) [19:37:00] !log gitblit added Danny_B as project admin [19:37:05] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Gitblit/SAL, Master [19:37:48] !log gitblit created instance danny.gitblit.eqiad.wmflabs (to be configured) [19:37:50] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Gitblit/SAL, Master [19:46:02] RECOVERY - Puppet run on tools-redis-1002 is OK: OK: Less than 1.00% above the threshold [0.0] [19:48:32] RECOVERY - Puppet run on tools-redis-1001 is OK: OK: Less than 1.00% above the threshold [0.0] [20:00:46] what is the proper component in phab to report task regarding special:novainstance? [20:01:46] Tried creating an instance and got "Failed to create instance.". Should I file a ticket? [20:02:16] jaufrecht: we're having some issues with mutante as well. so i guess yes [20:03:11] Matthew_: are you currently here? [20:03:53] i created one succesfully a little while ago. in our case it wasn't an issue when creating it, just ssh to it [20:04:00] Matthew_: It seems like https://tools.wmflabs.org/xtools/rfa needs an restart [20:04:11] Luke081515: Yes, here. [20:04:16] Daggum it... [20:05:16] !log tools.xtools Restarted webservice, rfa tool hanging [20:05:20] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.xtools/SAL, Master [20:05:24] Luke081515: {{done}} [20:05:41] Matthew_: thank you very much :) [20:05:55] !log tools disabled tools-webgrid-lighttpd-1202, is hung [20:05:59] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [20:06:13] 10Labs-project-Wikistats: wikistats (labs project): remove test wikipedias from wikipedia table - https://phabricator.wikimedia.org/T142730#2545220 (10Dzahn) 05Open>03Resolved ``` MariaDB [wikistats]> delete from wikipedias where prefix="test"; Query OK, 1 row affected (0.00 sec) MariaDB [wikistats]> delete... [20:06:55] who should I assign it to? [20:08:55] 06Labs, 10Phlogiston (Interrupt): Error creating new instance - https://phabricator.wikimedia.org/T142742#2545239 (10JAufrecht) [20:11:22] 10Labs-project-Wikistats, 10Internet-Archive: Remove some big former MediaWiki sites - https://phabricator.wikimedia.org/T136184#2545252 (10Dzahn) a:03Dzahn [20:13:26] !log tools tools-grid-master finally stopped [20:13:30] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [20:13:49] chasemp I issued a shutdown command to it and it just waited forever and finally stopped [20:14:08] (can confirm it stopped on the host as well) [20:14:10] waiting for it to come back on now [20:15:05] PROBLEM - Host tools-grid-master is DOWN: CRITICAL - Host Unreachable (10.68.20.158) [20:16:48] yuvipanda: I had done the reset state prior I think [20:17:04] yuvipanda: also 'nova restart --hard' is a thing fyi [20:17:06] was going to try that [20:17:11] 10MediaWiki-extensions-OpenStackManager: Treat or strip escape sequencies in console output - https://phabricator.wikimedia.org/T142744#2545273 (10Danny_B) [20:17:18] chasemp ah, I see. ok [20:17:26] chasemp it's still in 'powering-on' state [20:17:29] gonna give it a few more minutes [20:21:21] yuvipanda: so jobs will be running but job submission etc is down atm and it's not up [20:21:25] yeah [20:21:26] Posted a long message: http://matrix.org/_matrix/media/v1/download/matrix.org/pDsnuJGFJdMHxWkSRiIRtVMt - 2016-08-11_20:21:25.txt [20:21:53] nova log? [20:22:21] nova-compute on labvirt1010 [20:22:49] chasemp I'm considering doing the following: [20:22:51] 1. shut it down again [20:22:54] 2. attempt to migrate it off [20:23:07] on the assumption that this is an issue on labvirt1010 [20:23:09] but digging through more logs now [20:23:22] RECOVERY - Host tools-grid-master is UP: PING OK - Packet loss = 0%, RTA = 0.87 ms [20:23:29] wahhhhh [20:23:41] chasemp was that you? [20:23:48] it's up now [20:23:55] I can ssh in [20:25:06] everything seems to be fine now? [20:25:17] if you didn't explicitly do anything this vaguely lends credence to 'everything works, just in ultra super slow motion, and sometimes times out' [20:25:40] RECOVERY - SSH on tools-grid-master is OK: SSH OK - OpenSSH_6.9p1 Ubuntu-2~trusty1 (protocol 2.0) [20:25:44] I can't ssh in yet [20:25:48] root@ prompts for pass? [20:25:59] can you see auth.log? yuvipanda [20:26:09] looking [20:26:21] I didn't do anything i was looking up that message [20:26:21] I found it, it's not errant it doesn't surprise me if it moved on [20:26:35] Aug 11 20:25:38 tools-grid-master sshd[3154]: pam_access(sshd:account): access denied for user `rush' from `bastion-restricted-01.bastion.eqiad.wmflabs' [20:26:48] Aug 11 20:25:38 tools-grid-master sshd[3154]: Failed publickey for rush from 10.68.18.66 port 39284 ssh2: RSA SHA256:uZ+iUZs4ceuu+fHpXniYddlsTBbIk47BFfq6Up+Bnls [20:26:50] that's interesting [20:26:54] not sure what's going on there? [20:26:58] I just got in as root [20:27:12] can you run puppet? [20:27:26] yeah running t now [20:28:05] Not in the tools.admin service group? [20:28:11] yuvipanda: it's not just that host for me [20:28:19] seems also tools-bastion-03 so it's something else [20:28:36] although I can get on their as rush [20:29:04] hmm [20:29:18] cn=tools.admin,ou=servicegroups,dc=wikimedia,dc=org doesn't like your account [20:29:23] chasemp so something very interesting is happening [20:29:24] though tools-bastion-03 should work regardless [20:29:28] 10Labs-Kubernetes: Install Helm on Kubernetes - https://phabricator.wikimedia.org/T142743#2545326 (10Danny_B) [20:29:31] the gridengine master doesn't have the gridengine master process running [20:29:31] doesn't list your account* [20:30:01] yep that's true [20:30:07] but...wut [20:30:11] chasemp yeah, ^ would explain some other unrelated woes from earlier maybe it got bumped out somehow? [20:30:23] that makes no sense [20:30:31] hahahahahahaha [20:30:32] oh my god [20:30:36] the failover actually worked! [20:30:43] and the shadow took over! [20:30:49] IT IS A MIRACLE! [20:31:00] but yeah, the gridengine shadow is now the master and stuff seems alright [20:31:10] root@tools-grid-master:~# cat /var/lib/gridengine/default/common/act_qmaster [20:31:11] ug I like it less than it not working at all because that instance is a mystery to me [20:31:14] weird man [20:31:15] tools-grid-shadow.tools.eqiad.wmflabs [20:32:01] https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Admin#GridEngine_Master has info on how that works [20:32:06] but I've nver had it actually work before [20:32:09] UNTIL NOW [20:32:13] anyway [20:32:28] chasemp add yourself to admin and try sshing in? Not sure why your root key doesn't work tho [20:33:12] chasemp also not sure if you saw, I found another instance that's hung like the jessie instances but is actually precise. (tools-webgrid-lighttpd-1202). I've depooled it just now [20:33:33] yuvipanda: is empty on tools-bastion-03 [20:33:33] tools-bastion-03:/etc/ssh/userkeys [20:33:35] that should ahve root defined in it? [20:33:45] uh [20:33:46] yes [20:34:05] wat [20:34:07] it's empty on grid-master too [20:34:12] then how the hell did I ssh in?! [20:34:25] yeah how did you get on it? [20:34:31] I just got in as root [20:34:59] RECOVERY - Puppet staleness on tools-grid-master is OK: OK: Less than 1.00% above the threshold [3600.0] [20:39:10] 10Labs-project-Wikistats, 10Internet-Archive: Remove some big former MediaWiki sites - https://phabricator.wikimedia.org/T136184#2545343 (10Dzahn) ``` MariaDB [wikistats]> delete from mediawikis where statsurl like "%wikibusiness.org%"; Query OK, 1 row affected (0.04 sec) MariaDB [wikistats]> delete from medi... [20:42:12] 10Labs-project-Wikistats, 10Internet-Archive: Remove some big former MediaWiki sites - https://phabricator.wikimedia.org/T136184#2545364 (10Dzahn) What would be super helpful for this kind of ticket is if you could refer to the wikis by their ID. That is in the column ID on the far right in the table, after t... [20:44:53] PROBLEM - Puppet run on tools-services-01 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [20:44:56] 10Labs-project-Wikistats, 10Internet-Archive: Remove some big former MediaWiki sites - https://phabricator.wikimedia.org/T136184#2545376 (10Dzahn) Also, for this specific tag, you can feel free to just assign them to me directly to get a faster response next time. I used to have a Herald rule here to make this... [20:46:21] (03PS1) 10Yuvipanda: Don't attempt to set root user password [labs/private] - 10https://gerrit.wikimedia.org/r/304321 [20:46:50] (03PS2) 10Yuvipanda: Don't attempt to set root user password [labs/private] - 10https://gerrit.wikimedia.org/r/304321 [20:55:18] PROBLEM - Puppet staleness on tools-webgrid-lighttpd-1202 is CRITICAL: CRITICAL: 25.00% of data above the critical threshold [43200.0] [21:00:19] RECOVERY - Puppet staleness on tools-webgrid-lighttpd-1202 is OK: OK: Less than 1.00% above the threshold [3600.0] [21:04:04] 10Labs-project-Wikistats, 10Internet-Archive: Remove some big former MediaWiki sites - https://phabricator.wikimedia.org/T136184#2545500 (10Dzahn) - https://www.wikiocity.com/api.php - fixed by changing http to https in URL and running update - also done: added Random Wikisaur https://wikistats.wmflabs.org/... [21:12:22] 10Labs-project-Wikistats, 10Internet-Archive: Remove some big former MediaWiki sites - https://phabricator.wikimedia.org/T136184#2545525 (10Dzahn) deleted all wikicafe.metacafe ``` MariaDB [wikistats]> delete from mediawikis where statsurl like "%wikicafe.metacafe%"; Query OK, 7 rows affected (0.03 sec) ```... [21:12:37] 10Labs-project-Wikistats, 10Internet-Archive: Remove some big former MediaWiki sites - https://phabricator.wikimedia.org/T136184#2545557 (10Dzahn) 05Open>03Resolved [21:22:16] 10Labs-project-Wikistats: Update lietuvai.lt statistics URLs - https://phabricator.wikimedia.org/T136183#2326049 (10Dzahn) ELIP is now rank 7 (some big but broken ones have been deleted). This issue is not limited to one wiki, i also see it for wikihow (rank 8) I think it happens when the statistics URL ends i... [21:22:32] 10Labs-project-Wikistats: Update lietuvai.lt statistics URLs - https://phabricator.wikimedia.org/T136183#2545579 (10Dzahn) a:03Dzahn [21:24:53] RECOVERY - Puppet run on tools-services-01 is OK: OK: Less than 1.00% above the threshold [0.0] [21:38:03] 06Labs, 10Tool-Labs: New entries in meta_p.wiki are missing a URL - https://phabricator.wikimedia.org/T142759#2545702 (10MaxSem) [21:45:31] 06Labs, 10Tool-Labs: New entries in meta_p.wiki are missing a URL - https://phabricator.wikimedia.org/T142759#2545702 (10AlexMonk-WMF) Also seems to have the wrong language. They both get lang=en [21:49:03] 06Labs, 10Tool-Labs: New entries in meta_p.wiki are missing a URL - https://phabricator.wikimedia.org/T142759#2545771 (10yuvipanda) @AlexMonk-WMF does your rewrite of the script take care of these too? if so I'm tempted to just hand fix them. [21:55:49] folks, I'm having an issue with an instance where most processes hang after being launched and refuse to respond to signals. syslog is filled with https://dpaste.de/GtaL/raw [21:57:20] hi earwig [21:57:26] heya yuvi [21:57:31] chasemp ^ another instance hung on jdb2 I see [21:57:34] earwig which instance is this? [21:57:39] wpx-prod-01 [21:57:46] what a day [21:58:20] earwig can you add me to the project? [21:58:31] chasemp I suspect we can ssh in and will find a few processes in D state here [21:58:50] give me a sec [21:58:55] ok [21:59:50] not sure if I have permission [22:00:00] ah [22:00:07] let me add myself ;) [22:00:12] earwig which project is it? wpx? [22:00:16] yes [22:00:33] ok [22:01:31] when I run a simple script, like replace.py, it interacts with me, but when I use jsub, it cannot interact with we. What should I do? [22:02:12] nope, can't ssh in. [22:02:37] hm, I can... [22:02:48] nn1|2 gridengine doesn't do interactive things. if you just want to run interactive pwb.py scripts I reccomend using https://www.mediawiki.org/wiki/Manual:Pywikibot/PAWS_walk-through [22:03:01] earwig I see. interesting. can you tail /var/log/auth.log [22:04:11] https://dpaste.de/4j0K/raw [22:05:58] hmm [22:07:04] root 18297 0.0 0.0 98000 932 ? D 22:01 0:00 sshd: yuvipanda [priv] [22:07:10] sshd hung too [22:07:13] it's stuck for m [22:07:14] there's yer problem [22:07:14] aaah [22:07:14] D state [22:07:16] there we go [22:09:36] 06Labs, 10Tool-Labs: New entries in meta_p.wiki are missing a URL - https://phabricator.wikimedia.org/T142759#2545808 (10AlexMonk-WMF) To be honest with you @yuvipanda at this stage I don't even know if that table has been generated with the old maintain-replicas.pl or my maintain-meta_p.py that I wrote last y... [22:10:39] earwig can you output 'uname -r'? [22:11:04] 3.19.0-2-amd64 [22:11:46] so that's jessie as well [22:11:47] hmm [22:11:55] kernel issue? [22:12:08] yuvipanda: I'm deep into looking at this kvm proc not that it may be useful but fyi [22:13:54] earwig not entirely sure, we're figuring it out. how much longer can this stay stuck? [22:14:27] I don't think it's extremely urgent [22:14:49] but I'd like to get stuff back up ... you know, before tomorrow [22:15:33] ok, will investigate and reboot later then [22:15:40] thanks for your help [22:35:14] 10Labs-project-Wikistats: wikistats (labs project): convert (all) mediawikis to use API instead of parsing old Special:Statistics - https://phabricator.wikimedia.org/T142766#2545950 (10Dzahn) conversion mode, method 1 @wikistats-cowgirl:~# /usr/bin/php /usr/lib/wikistats/update.php mw convert 10 wikis succesfu... [22:37:13] 10Labs-project-Wikistats: wikistats (labs project): convert (all) mediawikis to use API instead of parsing old Special:Statistics - https://phabricator.wikimedia.org/T142766#2545982 (10Dzahn) [22:53:03] I see T142165 has a patch merged - any hope labs will be back to normal soon? [22:53:03] T142165: Default source group (security group) allowances do not update properly - https://phabricator.wikimedia.org/T142165 [22:54:01] SMalyshev: I think instance creation was turned back on... [22:54:14] bd808: what about security groups? [22:54:36] I believe you can manage them via horizon now, but not the wiki [22:54:46] * bd808 hasn't actually tried [22:55:21] horizon does a better job with them than the wiki anyway. You can change groups there after instance creation [22:55:34] which is really nice because I forget half the time [22:57:45] ah ok I will try horizon [22:59:58] bd808: you can change them afterwards? [23:00:26] tom29739 yep [23:00:30] tom29739: yeah you can add more security groups to existing instances via horizon [23:00:35] Oh. [23:00:39] That's really good. [23:01:41] I didn't think that option worked, a doc somewhere said it couldn't be done [23:01:55] i tried it yesterday [23:02:06] * tom29739 goes to find and update that doc [23:04:07] hmm... in security group, description field is not mandatory, but has minimum character limit of 1 [23:04:12] that's sneaky [23:12:16] report it to horizon upstream? :) [23:33:38] Amir1: looks like there was a recent change to the format of the ORES API, is this permanent? I recall it happening before and it got rolled back [23:34:35] this now errors out as an undefined index: https://github.com/Niharika29/PlagiabotWeb/blob/master/src/Controllers/CopyPatrol.php#L85 [23:47:28] PROBLEM - Puppet run on tools-bastion-03 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0]