[00:33:13] any labs folks around. Having a bit of nodepool trouble in CI. [00:35:57] seems to be hung up on connecting to labnet1002.eqiad.wmnet [00:42:50] andrewbogott: ^ last time this came up it was a rabbitmq thing [00:43:16] instance creation seems to be down as well. AFAICT [00:43:48] YuviPanda, ^ [00:43:55] uh [00:44:17] looking thcipriani [00:44:22] thanks [00:44:25] thcipriani: where did you see labnet1002.eqiad.wmnet? [00:44:37] in the nodepool logs [00:44:52] I wonder why it needs to connect to that [00:45:02] also in trying to manually delete an instance marked as delete when I do: nodepool list [00:45:04] nodepool delete --now 103676 [00:45:22] gives me a bunch of INFO urllib3.connectionpool: Starting new HTTP connection (1): labnet1002.eqiad.wmnet [00:45:26] that eventually timeout [00:45:33] thcipriani: try now? [00:46:03] Maybe it needs to connect directly to the instances it's managing? [00:46:14] I just restarted nova-conductor [00:46:23] oh, it's explicitly going to labnet1002 [00:46:26] okay. . . [00:46:30] hmm still seeing a lot of Exception: Timeout waiting for server 91cb7084-1022-447f-bd94-a90a4a7fb5f7 deletion in wmflabs-eqiad [00:46:32] we've always solved this with either a nova conductor, nova scheduler or rabbitmq restart [00:46:46] last time this happened it was a rabbitmq restart [00:46:50] right [00:47:08] still seems plugged up from what I can see :( [00:47:09] nothing useful in any of the logs? [00:47:14] oh wait [00:47:29] starting to see instances being deleted in the logs [00:47:35] INFO nodepool.NodePool: Deleted jenkins node id: 103684 [00:47:50] \o/ [00:47:56] so are things moving? [00:48:52] eh, not just yet ... /me flails at it some more [00:49:02] I can restart the next thing in line [00:49:10] which is nova scheduler [00:49:24] thcipriani: let me know if it doesn't move at all [00:50:49] eh, doesn't seem to be moving after all :( [00:51:38] ok [00:52:15] it logs "Deleted nodepool instance xxxx" before it goes into it's loop with labnet1002. I think it was just retrying. [00:52:41] not sure what exactly is even running on labnet that it is connecting directly to it [00:52:53] The last time this happened andrewbogott said that he'd just learned at the openstatck conf that "it's almost always rabbit" that has locked up [00:53:16] right [00:53:19] so I'm going to restart it next [00:53:25] thcipriani: shall I? [00:54:09] YuviPanda: yup. no change in the instance list, all instances show "delete". Logs show that nodepool is trying to delete. No change. [00:55:40] the restart takes a while [00:55:42] fun [00:56:26] thcipriani: restart finished [00:56:41] still left to do is nova-network restart [00:56:43] on labnet [00:57:44] it's creating new instances! [00:57:53] wooo! [00:58:04] so bd808 and andrewbogott and thcipriani were right and I should've just retarted rabbitmq first time [00:58:06] oh well [00:58:26] :P [00:58:34] still moving slowly...manually deleting some now [00:58:39] ok! [00:58:48] * YuviPanda isn't too much of a fan of nodepool, but that's neither here nor there [00:59:08] I'm going AFK now, thcipriani! is that ok? [00:59:40] YuviPanda: I *think* so. I'm able to delete instances, and they're getting rebuilt slowly. [00:59:55] thank you for your help! [01:00:07] np! [01:03:41] do we know why rabbitmq keeps locking up? [01:04:02] is probably being chased by a firefox... [01:04:10] xD [01:04:41] legoktm: I think its on andrew's never ending list of things to investigate [01:05:10] but I would go with the short answer of "because message queues break sometimes" [01:05:16] * legoktm nods [01:31:01] 06Labs, 10Tool-Labs, 06Operations, 13Patch-For-Review: setup Letsencrypt for toolserver.org (toolserver.org certificate to expire 2016-06-30) - https://phabricator.wikimedia.org/T134798#2277377 (10BBlack) The last fix P3110 may leave it in a state where it can't renew, but honestly I'm not certain. We can... [01:47:16] YuviPanda: I changed the file handle limit so it really shouldn't be locking up… but, good to know. [03:28:28] 06Labs: glancesync cron spam - https://phabricator.wikimedia.org/T135463#2299780 (10Andrew) [05:10:23] 06Labs, 10Tool-Labs, 10Pywikibot-core: Tool Labs: shared Pywikibot code not available - https://phabricator.wikimedia.org/T125505#2299912 (10Ato_01) 05Resolved>03Open p:05Triage>03Normal [05:12:20] 06Labs, 10Tool-Labs, 06Operations, 13Patch-For-Review: setup Letsencrypt for toolserver.org (toolserver.org certificate to expire 2016-06-30) - https://phabricator.wikimedia.org/T134798#2277377 (10Dzahn) added @Nemo_bis because of the open gerrit change https://gerrit.wikimedia.org/r/#/c/227079/ . this mi... [06:51:14] 06Labs, 10Tool-Labs, 06Operations, 13Patch-For-Review: setup Letsencrypt for toolserver.org (toolserver.org certificate to expire 2016-06-30) - https://phabricator.wikimedia.org/T134798#2300014 (10Nemo_bis) >>! In T134798#2299914, @Dzahn wrote: > we should add one more cert for wiki.toolserver.org besides... [08:11:35] 10Tool-Labs-tools-Global-user-contributions, 07Easy, 07Regression: GUC tool always outputs "searched 1 projects" instead of actual count - https://phabricator.wikimedia.org/T118662#2300205 (10Danny_B) Thanks. I thought the counter increases via JS as one would expect it... [12:10:16] (03PS1) 10Youni Verciti: Rev 0.4 Join documentation's namespaces & add lib [labs/tools/fr-wikiversity-ns] - 10https://gerrit.wikimedia.org/r/289182 [13:12:53] 06Labs, 10Tool-Labs, 10Pywikibot-core: Tool Labs: shared Pywikibot code not available - https://phabricator.wikimedia.org/T125505#2300984 (10Ato_01) 05Open>03Resolved p:05Normal>03Triage [14:36:13] PROBLEM - SSH on tools-webgrid-lighttpd-1408 is CRITICAL: Server answer [15:41:31] 06Labs, 10Labs-Infrastructure, 06Operations, 13Patch-For-Review: rename holmium to labservices1002 - https://phabricator.wikimedia.org/T106303#2301413 (10Andrew) [15:41:34] 06Labs, 10Labs-Infrastructure, 06Operations, 13Patch-For-Review: Get labs-ns0, labs-recursor0, and labservices1001 on the same system, and labs-ns1, labs-recursor1, and holmium on another - https://phabricator.wikimedia.org/T135447#2301412 (10Andrew) 05Open>03Resolved [15:43:31] andrewbogott: around? [15:43:48] tgr: Yes but only for another minute or two. What's up? [15:44:04] could you help me out with https://phabricator.wikimedia.org/T131630 some time today? [15:45:28] tgr: yes, I'll have a look later in the day. That accent worries me a bit, I'll have to dig in the code and see if that's supported. [15:45:31] Will you be a round in an hour or two? [15:45:41] yes [15:45:44] thanks! [15:54:59] andrewbogott, do you know if there has been an upgrade or any change very recently regarding the wiki at silver? [15:55:31] maybe Krenair know something? [16:00:58] jynus: MediaWiki on silver would be updated by the deploy train every Tuesday [16:01:18] ah, it was integrated, then [16:01:44] yeah. It's a "group 1" host -- https://tools.wmflabs.org/versions/ [16:02:04] jynus: I saw the search errors report in -ops elastic has had some issues today, same thing was happening on other wikis (for serach) and I believe wikitech is using the prod es cluster [16:02:14] but afa weird db stuff that on clue [16:03:04] ah, sorry, I thought it was wikitech-specific [16:03:04] ""An error has occurred while searching: Search is currently too busy. Please try again later."" parav.void reported for officewiki (which I guess uses es?) earlier [16:03:21] I saw other things, though [16:03:40] "slave error" on a slaveless service [16:03:50] I will check it later [16:04:41] gehel: ^ ppl still reporting search issues as of last few minutes fyi [16:06:07] chasemp: Thanks! Too busy reading thread dump to watch out... [16:07:11] damn, puppet restarted elasticsearch behind my back... [16:10:33] 06Labs, 10wikitech.wikimedia.org: Reset OAuth authentication for Wikitech account for Niharika29 - https://phabricator.wikimedia.org/T135518#2301502 (10Niharika) [16:10:36] it seems we had a peak in response time for the last ~20 minutes. It seems to settle again. This time, elastic1001 and elastic1003 show high load. [16:16:27] we still have a few relocating shards (24), but going down. Probably the rebalance after 1001 coming back online. Segment merges in progress on both 1001 and 1003. I see a small peak of rejections around 1600 UTC, but it seems stable after that. I'll wait a few more minutes and keep an eye on the graphs. If things don't settle, I'll move all traffic to [16:16:27] codfw. [16:35:54] Can you change the webroot for a tool to be a different folder on tool labs? [16:36:00] For lighttpd. [16:38:55] tom29739: I've just used symlinks when I need that rather than overwriting the lighttpd config [16:40:03] That has the advantage of being visible in a directory listing rather than needing to find the config to figure out what it has been set to [16:41:04] as an example, the admin tool symlinks ~/public_html to ~/toollabs/www [16:41:46] Can it be done with the webserver running, or do I have to restart it? [16:42:12] that I'm not sure about. It seems like it should work without a restart [17:09:48] bd808: is there a "how to deploy code to the beta cluster" page? [17:10:05] "merge it in gerrit" [17:10:15] I need to bisect https://phabricator.wikimedia.org/T135525 [17:11:09] live hacking beta cluster is a bit difficult because of the jenkins jobs [17:11:37] if it works for a few minutes that's good enough for me [17:11:49] * bd808 looks to see if we have any instructions [17:15:59] tgr: lol. the instructions at https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/HetDeploy are like 4 years out of date [17:17:54] but at least appropriately warned against! [17:22:34] tgr: I'm going to try to update the page. The tl;dr should be 'ssh to deployment-tin; sudo -u jenkins-deploy -s; do things like you would on tin' [17:23:07] but also note that you will be fighting with jenkins for config and code changes [17:31:28] tgr: Can you try a horizon login while I look at the logs? [17:31:49] (this will most likely require several tries) [17:32:00] andrewbogott: at a meeting; can we do it in 30 mins? [17:32:10] sure, just ping me when you're available [18:01:57] (more like 60; sorry, working on an UBN bug) [18:05:22] (03PS1) 10Youni Verciti: Rev 0.5 Pedagigical step - projet.py [labs/tools/fr-wikiversity-ns] - 10https://gerrit.wikimedia.org/r/289252 [18:18:20] 06Labs, 13Patch-For-Review: glancesync cron spam - https://phabricator.wikimedia.org/T135463#2302045 (10Andrew) 05Open>03Resolved The above patch seems to have fixed this. [18:24:40] 10MediaWiki-extensions-OpenStackManager, 10MediaWiki-Authentication-and-authorization, 06Reading-Infrastructure-Team: Update OpenStackManager to use AuthManager - https://phabricator.wikimedia.org/T110288#2302098 (10Anomie) 05Open>03Resolved a:03Anomie Ok, let's declare victory here. [18:26:18] so... if I have an URL like http://devhub.wmflabs.org/ which is a 502, how could I find out somewhere on wikitech if that is "intentional" (outdated link) or not? [18:27:14] (...or the project or instance name corresponding to that URL. Or whatever else makes any sense.) [18:32:47] andre__afk: I would start with https://tools.wmflabs.org/ [18:32:55] but there's not a perfect system for this [18:33:20] Hmm, I didn't think of tools as it's not under tools.wmflabs.org [18:33:22] I see [18:33:34] oh, good point [18:33:35] hm [18:34:20] As S isn't working on that anymore I'm basically wondering if that URL is intentionally dead or not. And if it is, I'd like to remove some links on wiki pages. [18:37:44] (03CR) 10Jean-Frédéric: [C: 032] Standardise php whitespace to tab [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/287290 (owner: 10Lokal Profil) [18:38:09] (03CR) 10jenkins-bot: [V: 04-1] Standardise php whitespace to tab [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/287290 (owner: 10Lokal Profil) [18:39:16] (03PS3) 10Jean-Frédéric: Change i18n message to allow non-Wikipedia projects [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/286796 (owner: 10Lokal Profil) [18:39:46] andre__afk: it's set up as a web proxy; I don't know offhand how to look up what it's proxying [18:54:16] tgr: still debugging? [19:20:42] andrewbogott: sorry, just finished [19:21:31] tgr: want to try a login now? [19:22:11] yes [19:23:00] done [19:23:04] andrewbogott: ^ [19:26:04] 06Labs, 10Horizon: Unable to login on Horizon - https://phabricator.wikimedia.org/T131630#2173147 (10Andrew) (keystone.identity.core): 2016-05-17 19:22:54,235 DEBUG Local ID: tgr (keystone.common.wsgi): 2016-05-17 19:22:54,240 ERROR 1267 (HY000): Illegal mix of collations (latin1_bin,IMPLICIT) and (utf8_genera... [19:26:17] tgr: ok, it's some kind of encoding screwup. I'll see what I can figure out [19:26:40] thanks! [19:27:16] and sorry for the trouble... in my defense, it was Labs policy at the time :) [19:28:09] bd808: i encountered this: https://phabricator.wikimedia.org/P3121 [19:29:17] andre__afk: That proxy is pointed at the http://developer-doc-devhub.eqiad.wmflabs backend instance in the developer-doc project [19:29:39] thedj: fun. [19:29:46] should be easy to fix [19:31:09] bd808: can you explain what it means, then maybe i can fix myself :) [19:32:09] how does the require redefine it ? that makes little sense to me. or is it a scoping problem or something ? [19:32:23] Puppet is declarative and doesn't allow multiple definitions of resources with the same name. The two files named in the trace are trying to declare the same resource [19:32:42] we have a "require_package" function to handle this sort of thing [19:33:01] so one or both manifests is not using that helper function [19:33:04] that's already used here... [19:33:05] require_package('nodejs') [19:33:31] in both files? [19:33:40] the other is the actual declaration [19:33:53] package { 'nodejs': ensure => latest, [19:34:13] ugh. done for the ensure => latest [19:34:27] ok, so not so easy of a fix maybe [19:35:24] hmm, shall i file a ticket in that case ? [19:36:00] sure [19:36:47] The graphoid module should be converted to look like all the other node services, but that may be more work than anyone wants to do [19:40:20] bd808: done: https://phabricator.wikimedia.org/T135549 [19:41:33] thanks. a work around would be to use `require ::npm` instead of installing the nodejs package in ::graphoid [19:41:44] that's probably the quick fix version [19:43:23] (03CR) 10Lokal Profil: [C: 032] "I take the rebase as a +2." [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/286796 (owner: 10Lokal Profil) [19:44:17] (03Merged) 10jenkins-bot: Change i18n message to allow non-Wikipedia projects [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/286796 (owner: 10Lokal Profil) [19:50:11] (03PS2) 10Lokal Profil: Standardise php whitespace to tab [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/287290 [19:55:01] (03PS3) 10Lokal Profil: Standardise php whitespace to tab [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/287290 [19:56:42] (03CR) 10Lokal Profil: "Rebased and also converted tests/ so that future ci won't complain" [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/287290 (owner: 10Lokal Profil) [20:04:33] (03PS4) 10Lokal Profil: Standardise php whitespace to tab [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/287290 [20:04:40] tgr: so, I have http://labtestwikitech.wikimedia.org/ and https://labtesthorizon.wikimedia.org [20:04:54] can you try to create an account on labtestwikitech and see if you can reproduce the issue? [20:05:07] You can use the same username/shell name but use a different password of course [20:05:51] tgr: not urgent but it would give us a safe place to test potential fixes [20:11:18] andrewbogott: login worked fine there [20:11:32] on labtesthorizon too? [20:11:38] yes [20:11:47] dammit :) ok, thanks [20:53:45] andrewbogott: hi there [20:54:00] need a cpu quota update limit please [20:56:10] oh, my ram exceeded too [20:56:41] matanya: sheesh. stop being such a resource hog ;) [20:56:54] :) [20:57:50] bd808: mining bitcoin is heavy :P [20:58:18] that's why I do it with common.js on enwiki [20:58:48] heh, lolz [21:09:29] matanya: can you open a ticket? I'm hoping to get some more hardware soon but it's unclear when/how much [21:09:48] andrewbogott: sure, any kind of eta ? [21:10:11] I should at least know what's happening by the end of the month [21:10:22] thanks [21:10:51] andrewbogott: so for the time being no quota limits changes ? [21:11:15] I need to do another audit, but things have been pretty tight [21:11:36] i see [21:13:08] 06Labs, 10Labs-Infrastructure, 05Continuous-Integration-Scaling, 13Patch-For-Review: Bump quota of Nodepool instances (contintcloud tenant) - https://phabricator.wikimedia.org/T133911#2303104 (10Andrew) a:03Andrew [21:13:25] * andrewbogott frowns at ganglia for a while [21:14:38] 06Labs: raise quota limit for project video - https://phabricator.wikimedia.org/T135560#2303109 (10Matanya) [21:16:07] andrewbogott: ^ [21:16:47] thanks [21:21:04] we can make shinken a bit smaller, IIRC there is a ticket about that? [21:21:08] (03PS1) 10Lokal Profil: Add lang and project to statistic reports [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/289314 (https://phabricator.wikimedia.org/T135502) [22:20:15] (03PS1) 10Youni Verciti: Rev 0.6 common part for ns_doc Ok [labs/tools/fr-wikiversity-ns] - 10https://gerrit.wikimedia.org/r/289321 [22:21:46] 06Labs, 10wikitech.wikimedia.org: Reset OAuth authentication for Wikitech account for Niharika29 - https://phabricator.wikimedia.org/T135518#2301502 (10Krenair) The process for dealing with these problems is here: https://wikitech.wikimedia.org/wiki/Password_reset#Reset_two_factor_authentication Basically we n... [22:32:45] PROBLEM - Free space - all mounts on tools-worker-1004 is CRITICAL: CRITICAL: tools.tools-worker-1004.diskspace.root.byte_percentfree (<70.00%)