[00:21:24] Coren: Any chance of you being able to restart the cluebot_redis_relay job from cluebotng? [00:23:03] hey a930913 [00:23:05] I can give it a shot [00:23:37] YuviPanda: \o/ [00:24:03] a930913: hmm, it isn't running - how was it started? [00:24:09] is it in the 'cluebotng' tool? [00:24:15] I have no idea how to start it [00:24:31] YuviPanda: Can you check the .bash_history? [00:24:48] I did [00:24:51] nothing there [00:24:59] find . -name '*redis*' [00:25:03] didn't find anything useful either [00:25:15] Anything relayish? [00:25:37] nothing in bash history [00:25:40] checking with find now [00:25:45] nope [00:25:49] nothing for *relay* either [00:28:09] Is there anything in bash history? [00:29:09] a930913: some django stuff [00:29:12] but no job submissions [00:29:16] (this is the cluebotng tool) [00:29:33] Oh, try cluebot? [00:29:40] Less the NG. [00:29:45] haha yeah just did [00:29:48] found the relay [00:29:52] want me to restart it? [00:30:24] It'll stop my daily email telling me defconbot is down :D [00:30:47] !log tools.cluebot restart cbnj_relay job per a930913 [00:31:00] Also it should fix defconbot. But priorities :p [00:31:07] heh [00:31:23] a930913: well, restarted [00:31:35] Is that relay or is there a specific redis one? [00:32:42] it's the only one with relay in its name [00:33:02] There is no redis one? [00:35:47] not that I can see [00:36:20] * a930913 hmms. [00:39:07] qacct is slow :o [00:39:42] yeah it linearly goes through every job ever [00:39:44] is quite terrible [00:41:22] YuviPanda: It doesn't seem to have worked :/ [00:41:28] :( [00:41:43] I don't really know anything about cluebot so not sure what I can do :( [00:43:28] Could I get access so I can have a poke around, seeing as the maintainers are no longer around? [00:44:00] (If I get time D: ) [00:55:31] YuviPanda: Thanks btw <3 [00:56:11] a930913: I don't really know unfortunately - if I can get a 'yes' from any of the current maintainers then it's easy to just add you... [00:56:19] a930913: even having that be a onwiki yes is ok [00:57:23] The point is, they are MIA. [00:57:54] right, and we don't really have a way for people to take over MIA tools [00:58:25] nor to figure out who decides who gets to take over [00:58:45] I thought that was the point of having tools as opposed to personal accounts? [00:58:59] yes, the point is that you give people access so when you disappear they can continue it [00:59:09] except that the current maintainers didn't give enough people access... [00:59:20] if it was a personal account they can't give anyone access... [00:59:46] is cluebot enwiki only? Does enwiki have a process that can be used for this particular case? [00:59:52] legoktm: ^ [01:00:14] Well you better hope legoktm is still around if I disappear then :p [01:00:18] a930913: also are you sure they're all MIA? I know that I poked someone named Richard (who is on the maintainers list) when it last went down [01:00:19] Yes and no [01:00:40] so maybe giving them all a poke would get at least one person to respond with 'yes, sure, take over' [01:00:42] Damianz was the last to maintain it. [01:00:52] Cobi responded the last time I emailed him [01:01:04] That was about a year ago? [01:01:34] Mmm, I suppose I could try Cobi or Crispy. [01:01:51] Rich Smith responded about a few months ago [01:01:58] @seen DamianZaremba [01:01:59] legoktm: I have never seen DamianZaremba [01:02:05] @seen Damianz [01:02:05] legoktm: Last time I saw Damianz they were quitting the network with reason: Ping timeout: 250 seconds N/A at 9/19/2015 4:15:23 AM (50d20h46m42s ago) [01:02:07] Though they actively stopped a while back as Rich then Damianz took over. [01:02:09] according to http://tools.wmflabs.org/ only Rich Smith and Damian are cluebot maintainors [01:02:23] yes they don't actually have access [01:02:26] only Rich and Damian do [01:02:31] Cobi has access to the on-wiki account [01:06:09] a930913: so leave Rich a message on wiki? [01:06:38] * YuviPanda has to go afk in a few minutes [01:07:29] * a930913 will try go on a hunt. [01:08:08] Night guys. Thanks again. [01:08:55] yw [06:09:00] !ping [06:09:00] !pong [06:09:02] ok [06:12:51] 10Quarry: Time limit on quarry queries - https://phabricator.wikimedia.org/T111779#1792561 (10yuvipanda) Yes, that seems to have been a bug that I hopefully have fixed :) Usually when they get killed they get their status set to 'killed' but apparently not that one. Now queries get killed in 30min. [06:21:18] 10Quarry, 5Patch-For-Review: Add chat or forum (Quarry) - https://phabricator.wikimedia.org/T117647#1792580 (10yuvipanda) There's a flow board linked from the top bar now! \o/ I wonder how it'll be used. [09:54:41] 10Tool-Labs-tools-Other: Crash of svgtranslate - https://phabricator.wikimedia.org/T118146#1792722 (10Aklapper) a:5Jarry1250>3None [10:09:55] I've found a glitch - http://quarry.wmflabs.org/query/6052 [10:10:08] Is supposed to be dropping stuff templated with DYKfile [10:10:12] It isn't [10:10:30] Mostly because it seems to have an issue with a local page for a file at Commons :( [10:10:36] (sigh) [12:29:39] I am trying to "become" my tool, but it is not working. Is this happening to anyone else? [12:37:47] works for me [12:37:57] what tool are you trying to become? [12:38:25] and you are uid=studiesworld right? [12:43:23] hi sDrewth [12:43:32] Can i have a word about a glitch? [12:44:10] https://quarry.wmflabs.org/query/6046 is continuing to show entries that it SHOULD be filtering out given the query [12:44:17] so somethings wrong [12:44:39] Either the relevant tables aren't being purged fully, or my query is overlooking something [12:44:47] Yes, I am trying to become tools.studiesworld [12:45:36] But, I was able to solve it. [13:10:24] 6Labs, 10Labs-Infrastructure, 7Swift: Provide Swift object store(s) for the labs projects - https://phabricator.wikimedia.org/T114998#1793225 (10hashar) p:5Triage>3Normal [13:21:40] 6Labs, 10Labs-Infrastructure, 7Swift: Provide Swift object store(s) for the labs projects - https://phabricator.wikimedia.org/T114998#1793287 (10hashar) [14:54:59] odd thing happened, it seems noone in the search project can log into estest100[12].search.eqiad.wmflabs. We can log into the 2 other instances in this project though [14:57:17] 6Labs, 10Labs-Infrastructure, 7Swift: Provide Swift object store(s) for the labs projects - https://phabricator.wikimedia.org/T114998#1793464 (10yuvipanda) [14:58:13] ebernhar1son: I can login [14:58:34] YuviPanda: how are you awake? :) [14:58:35] ebernhar1son: I can watch the logs if you wanna login [14:58:40] sure i'll try now [14:58:56] PROBLEM - Puppet failure on tools-master is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [14:59:02] YuviPanda: failure on estest1001 just now [14:59:56] ebernhardson: so puppet's failing in a strange state where it's broken up on some commit where ldap is slightly broken [15:00:03] Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Could not find data item elasticsearch::cluster_hosts in any Hiera data file and no default supplied at /etc/puppet/manifests/role/elasticsearch.pp:34 on node estest1001.search.eqiad.wmflabs [15:00:05] Warning: Not using cache on failed catalog [15:00:06] oh, thats probably my fault [15:00:07] Error: Could not retrieve catalog; skipping run [15:00:13] estest1001 is self hosted puppet master i havn't updated [15:00:17] hah [15:00:18] :) [15:00:18] could you try just rebase? [15:00:21] it auto updates though [15:00:26] hmm [15:00:36] it is at latest commit [15:00:53] there's like a ticket somewhere to make it easier for project admins to ssh in as root [15:00:56] i wonder why it didn't also break 1003, but no matter [15:00:58] I should fix that [15:01:03] yeah me neither [15:01:10] ok so I'm going to add your key to root keys ebernhardson [15:01:14] thanks [15:01:15] and then you can sssh in and fix it :) [15:01:20] yea that works :) [15:01:37] kkk [15:01:52] ebernhardson: I'm up because stupid kubecon.io ppl put their keynote at 8:45 AM [15:02:16] YuviPanda: let me know how it is :) [15:02:46] chasemp: will do! [15:02:54] ebernhardson: you can login as root on 1001 now [15:03:13] is this a new host or a pre-existing one? [15:03:23] curious if a broken pupppet hosed up an existing vm [15:03:34] chasemp: pre-existing ones, although I did migrate these a few days ago [15:03:43] YuviPanda: your a speaker too eh, should be fun :) [15:03:59] also root login worked, thanks [15:04:03] 6Labs: Move estest100{1..3} instances to labvirt05, 10 or 11 - https://phabricator.wikimedia.org/T117927#1793495 (10yuvipanda) 5Open>3Resolved I've moved these to labvirt1005 last week [15:04:10] yup [15:04:38] maybe later we can kick around the why of login failure here then [15:04:41] next week even [15:05:23] chasemp: so looks like puppet failed at some point where it didn't write out /etc/ldap.yaml [15:05:28] and then ssh doesn't work without that [15:05:43] I guess the better question is, did ldap login there ever work? [15:05:45] ahh, it looks like the problem is the puppet repo is in the middle of a rebase. fun :) [15:05:51] that's a better question than is this pre-existing vm [15:07:17] hahah [15:07:19] nice [15:07:25] so the autorebaser failed and didn't clean up [15:07:35] ebernhardson: would be great if you can file a bug with details and / or fix it :D [15:07:47] YuviPanda: i'll do both [15:07:52] awesome [15:07:55] now I gotta runnnn [15:08:01] well, i'll fix this machine and maybe fix the autorebaser, depends how it works but i'll try at least :) [15:08:04] \o [15:08:54] heh [15:10:41] chasemp andrewbogott Coren btw, I did some graphite yesterday [15:10:43] http://graphite.wmflabs.org/render/?width=586&height=308&_salt=1447081798.367&target=project-proxy.novaproxy-01.reqstats.line_rate&target=tools.tools-proxy-02.reqstats.line_rate [15:10:58] that's the req/s rates of both the tools and general labs proxy [15:11:34] good deal man, we should make a grafana dashboard [15:11:39] chasemp: yea +1 [15:11:47] chasemp: dunno if there's one that reads from labs graphite [15:11:56] chasemp: I should also add more metrics [15:12:07] chasemp: 'current ssh sessions' seems like next useful thing to have [15:12:18] on bastions? [15:12:20] yeah [15:12:25] and current mosh sessions too [15:12:26] agreed [15:12:32] I dunno if there's a pre-existing collector [15:12:35] wouldn't be too hard writing one [15:12:41] probably not but yeah simple [15:12:45] chasemp: another thing would be to get per-tool counts [15:12:49] which would be great [15:12:55] will require writing a simple logster plugin [15:13:21] chasemp: I'm blocked in k8s land on getting https://github.com/kubernetes/kubernetes/pull/16250 sorted so might pick up these other things to do [15:13:37] chasemp: also +1 to trying out the new thing you suggested this week even though I won't be there [15:13:54] go conference dude :) [15:14:09] it's like 1h30min away [15:14:16] * YuviPanda goes to make tea [15:29:08] andrewbogott: YuviPanda about? toolserver.org is down trying to sift through this and it seems there is a 'toolserver-legacy' project [15:29:27] but there is a toolserver_legacy module which is not in any role which applies a role login banner [15:29:46] yes, that’s all true… did the cert expire? [15:30:29] not sure yet, where does this live?...well it's responding now maybe [15:31:09] so cert expire can't be as it came back [15:31:15] and it was unresponsive entirely [15:37:16] I believe there’s one instance in toolserver_legacy that hosts the banner and/or redirect. YuviPanda that’s your thing isn’t it? [15:38:56] RECOVERY - Puppet failure on tools-master is OK: OK: Less than 1.00% above the threshold [0.0] [15:42:47] 6Labs, 10Labs-Infrastructure, 10hardware-requests, 6operations, and 2 others: Labs test cluster in codfw - https://phabricator.wikimedia.org/T114435#1793660 (10chasemp) [15:42:49] 6Labs, 10Labs-Infrastructure, 10netops, 6operations, and 3 others: Allocate subnet for labs test cluster instances - https://phabricator.wikimedia.org/T115492#1793659 (10chasemp) 5Open>3Resolved [15:48:27] andrewbogott: so failures correlate pretty close to ^ for tools-master.tools.eqiad.wmflabs [15:48:35] wonder if there is a common thread [15:48:53] the puppet failure? [15:49:11] Seems… unlikely but possible. Was the puppet failure due to a merged patch,or just out of nowhere? [15:49:37] yeah weak question but out of nowhere afaik [15:56:12] andrewbogott: not sure, and now that i'm logging in to look it seems i again don't have root :S [15:56:29] andrewbogott: the rebase that failed was because it didn't cleanly rebae, not sure why it didn't rebase --abort afterwards [15:56:49] but after aborting, puppet agent -tv still failed and i didn't get a chance to look [15:57:21] oh man, I just re-remembered that tools has a local puppetmaster. I hate that [15:57:45] (by not have root, i mean yuvi added my ssh key to root on estet1001.search.eqiad.wmflabs, and it's no longer letting me login) [15:58:36] andrewbogott: (for posterity) tracked it down to relic.toolserver-legacy.eqiad.wmflabs [16:01:59] ebernhardson: I’m super confused about what we’re talking about. Does estet1001.search.eqiad.wmflabs somehow relate to our discussion about toolserver-legacy? [16:02:39] When you said ‘not sure’ above… what were you not sure about? [16:03:03] andrewbogott: ahh, i didn't realise the self hosted puppet master convo switched, just before that we were talking about self hosted estest1001.search.eqiad.wmflabs and its ldap failure [16:03:27] basically noone can log into it except root [16:03:30] oh [16:03:32] um... [16:03:36] I was not part of that conversation [16:03:41] no worries :) [16:03:45] but can look if yuvi vanished [16:03:57] yea he did, but enough would just be adding my ssh key to root and i can figure it out [16:04:06] yuvi did, but then i logged out and somehow that key disapeared [16:05:41] andrewbogott: no sorry for the confusion there, the two aren't related [16:07:26] Coren: did you write a patch to page on virt node disk space? [16:08:19] he did I recall +1'ing [16:08:40] can I get a link for my outage report? [16:08:55] andrewbogott: https://gerrit.wikimedia.org/r/#/c/251297/ [16:09:07] thanks! [16:09:38] andrewbogott: https://gerrit.wikimedia.org/r/#/c/251292/ goes w/ it fyi [16:17:04] * Coren will bbiab, has an appointment. [16:52:21] andrewbogott: chasemp no that's Coren's [16:52:24] * YuviPanda just got to conference [16:53:07] andrewbogott: no tools doesn't have a local puppetmaster - just the k8s stuff inside tools has a local puppetmaster [16:53:17] andrewbogott: chasemp and tools-master should be unrelated to toolserver [16:54:12] YuviPanda: yeah, I was briefly confused [16:54:35] yeah it's a bit confusing. so the only 'regular' toollabs things that are on the self hosted puppetmaster are the proxies [16:54:52] and even those are set to auto pull every minute [17:04:57] who does when hand out the Tech News 2015-46 https://meta.wikimedia.org/wiki/Tech/News/2015/46 [17:46:22] 6Labs, 10Labs-Infrastructure, 5Patch-For-Review, 3labs-sprint-117: Give 'novaobserver' keystone account rights to read everything, everywhere, write or change nothing - https://phabricator.wikimedia.org/T104588#1794104 (10Andrew) Note that I plan to make the creds for this user 100% public. [18:05:43] 6Labs, 10Labs-Infrastructure, 3labs-sprint-117, 3labs-sprint-118, 3labs-sprint-119: Move project membership/assignment from ldap to keystone mysql - https://phabricator.wikimedia.org/T115029#1794195 (10Andrew) [18:34:20] YuviPanda: hello! it looks like the cronjob on https://android-builds.wmflabs.org/ isn't running. app builds are being properly published from https://integration.wikimedia.org/ci/job/apps-android-wikipedia-publish/ but aren't getting picked up on schedule. would you mind restarting or checking on this machine? [18:34:49] niedzielski: probably can't get to it till thursday :( at a conference [18:35:01] do you have access to take a look at it? [18:36:28] YuviPanda: aw nuts. hm, i don't think so. is it at android-builds.eqiad.wmflabs? [18:37:07] niedzielski: yeah think so. you should have access, I can proably give you access [18:37:23] YuviPanda: that'd be great! [18:38:51] andrewbogott: chasemp Coren can one of you add niedzielski to the mobile project on labs? [18:38:59] niedzielski: it's 'android-builder' [18:45:24] YuviPanda: ok thanks. i'll check it out when i'm added! [18:54:13] anyone with root on created labs instances able to add my key to root@estest1001.search.eqiad.wmflabs? It has had a failure of the self hosted puppet master and it's not possible to log in [19:04:23] ebernhardson: kk let me do so again [19:04:28] niedzielski: what's your wikitech usernam? [19:04:40] YuviPanda: thanks, puppet overwrote the change i imagine [19:04:59] ebernhardson: so I added your key manually to ebernhardson [19:05:00] YuviPanda: i think it's "niedzielski". if that's not there, then "sniedzielski" [19:05:03] so you should be able to ssh in now [19:05:13] denied [19:05:53] ebernhardson: try again? [19:06:07] YuviPanda: works! thanks [19:06:14] ebernhardson: yw [19:06:22] niedzielski: added niedzielski, try 'android-builder.eqiad.wmflabs' [19:06:46] YuviPanda: woo! works great! [19:06:53] YuviPanda:thanks! [19:07:16] yw [19:07:22] YuviPanda: is it ok to reboot this guy if i need to? [19:07:58] niedzielski: yup [19:08:04] you should have root [19:08:40] YuviPanda: got it, thanks! [19:33:09] YuviPanda: you all sorted? was grabbing food but I'm about now [19:33:30] chasemp: yup [19:49:17] YuviPanda: I'm around [20:15:04] (03PS1) 10Niedzielski: Fix crontab schedule and directory structure [labs/tools/wikipedia-android-builds] - 10https://gerrit.wikimedia.org/r/252027 [20:26:27] 10Quarry: Time limit on quarry queries - https://phabricator.wikimedia.org/T111779#1794532 (10Jarekt) yuvipanda, it seems like there is still a problem, as http://quarry.wmflabs.org/query/2556 is running for 2 hours now. By the way, is there a way to get this query to go faster? [20:32:36] (03CR) 10BearND: [C: 031] Fix crontab schedule and directory structure [labs/tools/wikipedia-android-builds] - 10https://gerrit.wikimedia.org/r/252027 (owner: 10Niedzielski) [20:33:18] (03CR) 10Yuvipanda: [C: 032 V: 032] Fix crontab schedule and directory structure [labs/tools/wikipedia-android-builds] - 10https://gerrit.wikimedia.org/r/252027 (owner: 10Niedzielski)