[00:00:46] 06Labs, 10Tool-Labs: Data collection for tools job_count seems to be broken - https://phabricator.wikimedia.org/T149634#2758878 (10bd808) 05Open>03Resolved a:03chasemp {F4682058} [00:05:26] legoktm ^^ it actually does try and re connect ssh. [00:08:12] 06Labs, 10MediaWiki-extensions-Newsletter: Clear 'nl_*' tables in http://newsletter-test.wmflabs.org/ - https://phabricator.wikimedia.org/T149651#2758891 (1001tonythomas) [00:09:18] paladox: On which line? [00:10:32] (03PS4) 10Paladox: Adds a grrrit-wm restarting command for you to type in irc [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/318976 (https://phabricator.wikimedia.org/T149609) [00:11:06] marktraceur hi, line 166 [00:11:15] It calls this ircClient.addListener('join', waitForChannelJoins); [00:11:54] which calls this https://phabricator.wikimedia.org/diffusion/TGRT/browse/master/src/relay.js;8ff7cf21a1f72783afad72f6fbbd6b3eb2e9402c$140 [00:12:08] then https://phabricator.wikimedia.org/diffusion/TGRT/browse/master/src/relay.js;8ff7cf21a1f72783afad72f6fbbd6b3eb2e9402c$149 [00:12:16] then https://phabricator.wikimedia.org/diffusion/TGRT/browse/master/src/relay.js;8ff7cf21a1f72783afad72f6fbbd6b3eb2e9402c$113 [00:12:22] marktraceur ^^ [00:13:06] paladox: OK, that starts a new connection, but doesn't kill the old one AFAICT [00:14:54] marktraceur, oh [00:14:55] Would this [00:14:56] }).on('end', function() { [00:14:56] console.log('Client disconnected'); [00:14:57] }); [00:14:58] ? [00:15:09] paladox: Also, it only works if the IRC client reconnects automatically, which I don't know enough to say it does [00:15:21] paladox: No, that's not an order to disconnect, it's a listener for disconnections. [00:15:28] oh [00:15:40] also we have a new command we are testing that should restart the bot [00:15:48] paladox: Let's talk about logical or [00:15:51] !grrrit-wm- [00:15:56] paladox: What does "a" || "b" evaluate to? [00:16:06] nicknames [00:16:16] What? [00:20:56] paladox: Try running the expression "a" || "b" in a javascript console and tell me if the result is what you expected [00:21:44] Oh i thought you were talking about when i did || for nicknames never mind [00:21:44] anyways i think a || b is it allowing you to do a or b [00:22:38] paladox: I'm not asking you what it does, I'm asking you to evaluate the expression and tell me what the result is [00:23:02] Oh, i doint know how to do that [00:23:54] paladox: Open a console, fire up nodejs, type in "a" || "b", and it should give you the result. [00:24:16] oh [00:24:20] ok thanks [00:24:53] "a" || "b" [00:24:59] > "a" || "b" [00:24:59] 'a' [00:24:59] > [00:25:06] marktraceur ^^ [00:25:11] oh so a will win [00:25:13] paladox: OK, you didn't need to paste all four lines. [00:25:21] paladox: Yeah, that's what we call short-circuiting. [00:25:25] sorru [00:25:26] sorry [00:25:34] paladox: I imagine you expected something like [ "a", "b" ] [00:25:41] paladox: But that's an array. [00:25:46] Oh yep [00:25:50] paladox: Those square brackets are how we create arrays [00:26:40] yep, so something like var whitelist = [ "paladox" || "mutante" || "Krenair" || "hashar" || "ostriches" || "greg-g" || "twentyafterfour" || "apergos" || "robh" ]; [00:26:42] will work? [00:26:48] Oh my god no [00:26:54] paladox: || is not how we define arrays [00:27:04] oh sorry [00:27:09] i forgot to remove that [00:29:11] paladox: And just in case you thought you were done, when you've fixed that part, try running "a" === [ "a", "b" ] in your console and see what happens [00:29:22] oh [00:29:58] marktraceur it says false [00:30:10] Yeah [00:30:12] Of course it does [00:30:26] paladox: Which means when you try to check from === whitelist, that won't work [00:30:27] oh [00:30:42] paladox: Luckily, you know how to use indexOf [00:31:02] Well i got that from http://code.runnable.com/UkmYEow-67ktAAFM/irc-command-bot-with-node-irc-for-node-js [00:31:19] oh now i get it [00:31:48] 10Tool-Labs-tools-Pageviews: Add Mediaviews to Pageviews suite - https://phabricator.wikimedia.org/T149642#2758938 (10MusikAnimal) [00:32:15] something like from === whitelist.indexOf [00:32:21] marktraceur ^^ [00:32:52] paladox: OK, I stand corrected, you also copied your use of indexOf [00:32:59] yep [00:33:01] paladox: Why don't you spend a few minutes looking at your code [00:33:10] paladox: See if you can figure out how indexOf works [00:33:20] paladox: If not, then maybe try googling "javascript indexOf" [00:33:26] Ok [00:33:27] thanks [00:33:30] paladox: Then once you think you have an answer, try running it in nodejs [00:34:18] oh now i get it. from.indexOf === whitelist [00:34:25] http://www.w3schools.com/jsref/jsref_indexof.asp [00:34:30] paladox: That is not even slightly correct [00:35:49] oh from.indexOf(whitelist) ? [00:36:02] from.indexOf(whitelist) !== -1 [00:36:10] paladox: Have you tried running something like that in a JS console? [00:36:11] marktraceur ^^ [00:36:36] http://stackoverflow.com/questions/1789945/how-to-check-if-one-string-contains-another-substring-in-javascript [00:41:37] paladox: Have you tried running the code in a JS console? [00:46:20] marktraceur yes didnt work since i was using undefined code [00:46:27] anyways i have to go sorry [00:46:32] paladox: Yeah, so test something similar [00:47:01] ok [00:47:17] I'll be around later [00:47:20] marktraceur: ill take a look i still have to write the backend for that lol [00:47:30] Zppix: What backend? [00:47:42] Functionality [00:47:55] Keeping it up to date via db [00:47:56] marktraceur grrrit-wm [00:48:16] Zppix: OK, and you understand how arrays work in JS right [00:48:29] I know a decent amount of js [00:48:46] Zppix: I guess I can be hopeful, thin [00:48:48] then. [00:48:49] PROBLEM - Puppet run on tools-services-02 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [00:49:02] marktraceur: lol [00:50:07] Zppix: I also have qualms about using nicks as "secure" identifiers, e.g., but that's not a JS concern, and I'm sure I'll have an opportunity to raise it at a later stage in the CR. [00:50:36] I plan on using hostnames [00:50:51] Excuse me cloaks [00:51:29] Well, cloaks aren't universal [00:53:58] I thought of that... You see i also plan on at some point making it so users have to "auth to the bot" to attempt to use any major high importance stuff [00:54:47] OK, I look forward to seeing further improvements [00:55:08] Me too [01:23:53] RECOVERY - Puppet run on tools-services-02 is OK: OK: Less than 1.00% above the threshold [0.0] [02:55:50] !log deployment-prep Managed to mess up the deployment-puppetmaster02 cert, had to move those nodes back [02:55:55] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/SAL [04:55:13] (03PS1) 10BryanDavis: Add COPYING file [labs/tools/stewardbots] - 10https://gerrit.wikimedia.org/r/319018 [04:55:16] (03PS1) 10BryanDavis: StewardBot: ratelimit @steward pings [labs/tools/stewardbots] - 10https://gerrit.wikimedia.org/r/319019 [05:07:04] 10Tool-Labs-tools-stewardbots: Evaluate cleanup on StewardBot's code - https://phabricator.wikimedia.org/T149404#2751215 (10bd808) Do you know what version of irclib is in use? I had to go back quite a way in the history of https://github.com/jaraco/irc even to see where the library was named irclib. I've been... [06:04:27] PROBLEM - Puppet run on tools-worker-1014 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [06:39:25] RECOVERY - Puppet run on tools-worker-1014 is OK: OK: Less than 1.00% above the threshold [0.0] [07:28:27] 06Labs, 10Gerrit, 10grrrit-wm: Create grrrit-wm test instance - https://phabricator.wikimedia.org/T149529#2759234 (10Paladox) [07:28:36] 06Labs, 10Gerrit, 10grrrit-wm: Create grrrit-wm test instance - https://phabricator.wikimedia.org/T149529#2755213 (10Paladox) 05stalled>03Open [07:56:21] (03PS2) 10BryanDavis: StewardBot: ratelimit @steward pings [labs/tools/stewardbots] - 10https://gerrit.wikimedia.org/r/319019 (https://phabricator.wikimedia.org/T148110) [07:58:43] (03CR) 10MarcoAurelio: "Does this support !steward as well?" [labs/tools/stewardbots] - 10https://gerrit.wikimedia.org/r/319019 (https://phabricator.wikimedia.org/T148110) (owner: 10BryanDavis) [08:03:34] (03CR) 10MarcoAurelio: "SASL in SSL mode should be fine I think, but I have no idea how to do that." [labs/tools/stewardbots] - 10https://gerrit.wikimedia.org/r/318229 (https://phabricator.wikimedia.org/T149265) (owner: 10Platonides) [08:24:42] (03CR) 10MarcoAurelio: [C: 032] Introduce tox + flake8 [labs/tools/stewardbots] - 10https://gerrit.wikimedia.org/r/318521 (https://phabricator.wikimedia.org/T128503) (owner: 10Hashar) [08:25:27] (03Merged) 10jenkins-bot: Introduce tox + flake8 [labs/tools/stewardbots] - 10https://gerrit.wikimedia.org/r/318521 (https://phabricator.wikimedia.org/T128503) (owner: 10Hashar) [08:29:05] (03CR) 10MarcoAurelio: "check experimental" [labs/tools/stewardbots] - 10https://gerrit.wikimedia.org/r/319019 (https://phabricator.wikimedia.org/T148110) (owner: 10BryanDavis) [08:31:28] (03CR) 10MarcoAurelio: "check experimental" [labs/tools/stewardbots] - 10https://gerrit.wikimedia.org/r/318229 (https://phabricator.wikimedia.org/T149265) (owner: 10Platonides) [08:35:47] 10Tool-Labs-tools-stewardbots, 10Continuous-Integration-Config, 13Patch-For-Review: Implement jenkins tests on labs/tools/stewardbots - https://phabricator.wikimedia.org/T128503#2759303 (10MarcoAurelio) Maybe we shouldn't be using the `mediawiki` queue. Is there a `labs` queue? (Sometimes the mediawiki queue... [08:36:39] 10Tool-Labs-tools-stewardbots, 10Continuous-Integration-Config, 13Patch-For-Review: Implement jenkins tests on labs/tools/stewardbots - https://phabricator.wikimedia.org/T128503#2759304 (10MarcoAurelio) I've merged the above change. It always fails on tox, but I guess it's because the bot code is old. [08:40:24] 10Tool-Labs-tools-stewardbots: Evaluate cleanup on StewardBot's code - https://phabricator.wikimedia.org/T149404#2759305 (10MarcoAurelio) >>! In T149404#2759123, @bd808 wrote: > Do you know what version of irclib is in use? I had to go back quite a way in the history of https://github.com/jaraco/irc even to see... [09:52:48] Is the tools elastic search cluster just for general use? I couldn't see any documentation about it; I don't want to tread on anyone's toes but I'd like to do some elasticy stuff and if I could do that without having VMs on labs that would be cool. [12:18:10] 06Labs, 10Gerrit, 10grrrit-wm: Create grrrit-wm test instance - https://phabricator.wikimedia.org/T149529#2759556 (10Aklapper) @Paladox: Why did you re-add the #Labs team project? [12:19:39] 06Labs, 10Gerrit, 10grrrit-wm: Create grrrit-wm test instance - https://phabricator.wikimedia.org/T149529#2759569 (10Aklapper) (In general: Could people please be specific, avoid using only "this" and "it", actually be explicit what they're talking about, and take more time to phrase sentences that do not of... [12:41:58] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/DatGuy was created, changed by DatGuy link https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Access_Request/DatGuy edit summary: Created page with "{{Tools Access Request |Justification=I am already a bot operator on the English Wikipedia. Soon, I might also have a bot that does continuous edits. I will need the tool labs..." [12:48:39] 06Labs, 10Gerrit, 10grrrit-wm: Create grrrit-wm test instance - https://phabricator.wikimedia.org/T149529#2759613 (10Paladox) Sorry, I didn't see the labs tag had already been added and then removed. Reason why I added the tag is because labs needs to create this new labs project so I can create the instance. [12:53:12] 06Labs, 06Operations: cronspam from labstores, labcontrol, labstestservices - https://phabricator.wikimedia.org/T149574#2759635 (10MoritzMuehlenhoff) The "Cron /usr/bin/rsync --delete --delete-after -aSO /srv/glance//images/ labcontrol1002.wikimedia.org:/srv/glance//images/" message... [13:27:32] !log tools reboot tools-exec-1404 post depool for test [13:27:35] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [13:40:24] Is anyone around to help me figure out why I can't SSH to one of my instanceS? [13:43:34] addshore: possibly, what: project / instance / user name / current behavior do you see [13:43:48] chasemp: db.cognate.eqiad.wmflabs [13:44:23] the scripts that conntect to it cant get to the mysql and I get channel 0: open failed: connect failed: No route to host when trying to ssh to it [13:45:08] wild guess if whatever script doesn't know how to use the bastion to access [13:45:28] by script I mean the wikis setup to read from the db (whoich are hosted on labs) [13:45:35] you can't connect directly to that instance which would probably surface as no route to host [13:45:37] it was working a few days ago [13:45:59] my ssh config is setup to proxy through bastion, sshing to other instances works just fine [13:46:09] yeah host seems unresponsive to me as well [13:46:17] I can attempt a power cycle? [13:46:22] possibly something there went awry [13:49:11] chasemp: yeh, go for it! [13:49:22] I have also tried that once in the last 10 mins! [13:50:43] chasemp: looking at nagf it dies aprox 1 week ago [13:55:37] Hello, I am looking for help with java/jdbc/tomcat on toollabs. Most important: what jdbc-driver do I have to select in the datasource object? [13:56:21] gradzeichen: bd808 sent out a java tutorial recently to labs-l I would look for that and read through it first (I haven't had time to myself) [13:57:13] ok, will do [13:57:57] addshore: I can see that it's up and running but cannot get in still...can you ping andrewbogott to take a look when he is around? [13:58:09] it did in fact reboot but still is dark so yeah, that's interesting [13:59:07] addshore: is ssh allowed in your default security group? [13:59:13] andrewbogott: yup [14:01:38] chasemp: checked it, it does not speak about database access [14:02:38] gradzeichen: I'm not sure, try asking bd808 when he's about it's still early there tho [14:02:46] possibly a good addition to the existing help [14:03:22] no rush regarding that instance, I just span up another one and got it all setup! [14:03:36] so do with db.cognate.eqiad.wmflabs as you wish! :) [14:03:56] addshore: I want to investigate a bit more; I'll delete it when I finish [14:04:17] andrewbogott: I saw it reboot and saw it live on labvirt1001 but oddly unavail [14:04:22] curious what you find :) [14:04:33] I bet that grub is broken :( [14:07:36] paladox: are you around? [14:29:00] addshore: db is back, do you want anything there or should I just delete it now? [14:30:19] chasemp: this was a side-effect of the kernel upgrades last week… I /think/ that this is one that failed the automatic upgrade process — I tuned grub by hand and set it to boot the precise kernel (which didn't exist since it was a trusty box) [14:30:42] Fixed by mounting by hand as per https://wikitech.wikimedia.org/wiki/OpenStack#Mounting_an_instance.27s_disk and changing menu.lst [14:33:33] 06Labs, 10Gerrit, 10grrrit-wm: Create grrrit-wm test instance - https://phabricator.wikimedia.org/T149529#2755213 (10Krenair) New project requests don't just need to be in #Labs, they also need to block T76375 - but I'm not sure this qualifies for a project of it's own. Why not just an extra tool, or even a... [14:34:49] 06Labs, 10Gerrit, 10grrrit-wm: Create grrrit-wm test instance - https://phabricator.wikimedia.org/T149529#2759828 (10Zppix) >>! In T149529#2759181, @greg wrote: > Of what? Again, please use the names of things you want a test instance of. I'm still confused on what you need. You haven't listed anything yet o... [14:44:18] 10Tool-Labs-tools-stewardbots: Evaluate cleanup on StewardBot's code - https://phabricator.wikimedia.org/T149404#2759864 (10bd808) >>! In T149404#2759305, @MarcoAurelio wrote: > Nope, sorry. I've been having a quick look through the files in WMF labs and I couldn't find anything. Maybe a shared library for all W... [14:48:09] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/DatGuy was modified, changed by Tim Landscheidt link https://wikitech.wikimedia.org/w/index.php?diff=943577 edit summary: [14:56:06] andrewbogott: ooooh I see! [14:56:14] Yes, feel free to nuke it now ! :) [14:58:10] (03PS3) 10BryanDavis: StewardBot: ratelimit @steward pings [labs/tools/stewardbots] - 10https://gerrit.wikimedia.org/r/319019 (https://phabricator.wikimedia.org/T148110) [14:59:43] (03CR) 10BryanDavis: "> Does this support !steward as well?" [labs/tools/stewardbots] - 10https://gerrit.wikimedia.org/r/319019 (https://phabricator.wikimedia.org/T148110) (owner: 10BryanDavis) [15:01:33] (03CR) 10BryanDavis: "> - tox-jessie https://integration.wikimedia.org/ci/job/tox-jessie/13041/console" [labs/tools/stewardbots] - 10https://gerrit.wikimedia.org/r/319019 (https://phabricator.wikimedia.org/T148110) (owner: 10BryanDavis) [15:44:55] 06Labs, 10Gerrit, 10grrrit-wm: Create grrrit-wm test instance - https://phabricator.wikimedia.org/T149529#2760101 (10Paladox) @Krenair oh, I guess we could have it on tools. But we need the ability to perminatly stop this test bot since it will just duplicate things if it is left running. [15:59:07] 06Labs, 10Gerrit, 10grrrit-wm: Create grrrit-wm test instance - https://phabricator.wikimedia.org/T149529#2760158 (10greg) I saw complaints last night of testing related to this making noise in our production Gerrit (and Phab?); what is your testing plan and how will you ensure that you are not disruptive in... [16:02:47] 06Labs, 10Gerrit, 10grrrit-wm: Create grrrit-wm test instance - https://phabricator.wikimedia.org/T149529#2760170 (10Paladox) @greg we could test on either an instance. Or duplicate grrrit-wm on the tools labs so that the production one is always working, and we can test using the test bot under a different... [16:05:19] tarrow: In theory the elasticsearch cluster in tool labs is for anyone to use. In practice I'm the only one who has used it yet. If you open a phab task about your use case and I can create you a set of credentials that will give you write access to an index. [16:07:41] 06Labs, 10Gerrit, 10grrrit-wm: Create grrrit-wm test instance - https://phabricator.wikimedia.org/T149529#2760183 (10Paladox) Ive managed to create a instance on the git project. It is a small instance. [16:08:29] bd808: can we add this to wikitech somewhere even if it's a "ask bd808?" [16:08:37] I honestly didn't recall the state of things either :) [16:11:04] chasemp: yeah. I should write that (and a zillion other things) [16:11:10] heh [16:11:19] now I know the answer is ask you [16:11:45] I said I would document the search relevancy cluster like a month ago, and no movement [16:11:54] gradzeichen: I don't think we have a mysql jdbc connector jar that is globally installed. I think that https://dev.mysql.com/downloads/connector/j/ would work for you [16:17:39] bd808: can i install myself? [16:18:03] or needs this to be installed globally to work with tomcat? [16:18:45] gradzeichen: hmmm.. good question about how to integrate with the tomcat setup. I really haven't played with that. [16:19:14] additional question: toollabs has java1.7 installed [16:19:24] current version of java is 1.8 [16:19:48] if i compile locally with 1.8, it will not work on labs [16:20:01] and i have to compile on labs [16:20:11] gradzeichen: yes. we are stuck with 1.7 for the foreseeable future [16:20:28] ok, but i really need jdbc to go on [16:20:47] there are some phab tasks about upgrading with lots of problems that we haven't figured out how to solve. [16:21:13] we may be able to do 1.8 in the kubernetes containers, but probably never on the OGE grid [16:22:01] trying to shim this into SGE seems like a fools game yeah [16:23:45] We are looking for alternatives to tomcat on kubernetes too [16:23:54] some more light weight servlet container [16:24:08] I have very little exp w/ tomcat actually [16:24:34] way back when running a confluence/jira stack which was a nightmare and that's about it [16:26:32] gradzeichen: so back to your question, yes you can download the jars and put them somewhere in your tool's $HOME. Then I guess we need to find out how to configure the classpath for tomcat to pick them up or find the right magic directory to put them in [16:26:52] I think you should be abel to jsut put them in your war file somehow [16:27:27] * bd808 is a bit hazy on how wars work anymore. it's been 10+ years since he was a full-time java developer [16:29:09] i downloaded the jar and uploaded it to my account. i will try to put it in my war, but at the moment my ssh is frozen. [16:47:18] bd808: Thanks, I've made a ticket here: https://phabricator.wikimedia.org/T149709 [16:48:33] 06Labs, 10Tool-Labs, 15User-bd808: Possible use of tools-lab-elasticsearch cluster - https://phabricator.wikimedia.org/T149709#2760359 (10bd808) [16:50:00] tarrow: I can probably get some stuff started for this Wednesday/Thursday. It will give me a good reason to document what we have and the potential limitations of the setup. [16:50:29] Thank you! That's great :) [16:50:59] the big one is basically that our Elasticsearch is a shared environment much like our redis service. Everyone needs to play nice or things will go badly for everyone. [16:52:42] (03CR) 10Paladox: "test" [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/318976 (https://phabricator.wikimedia.org/T149609) (owner: 10Paladox) [16:52:42] (03CR) 10Paladox: "test" [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/318976 (https://phabricator.wikimedia.org/T149609) (owner: 10Paladox) [16:55:31] sure; I would probably appreciate details on 'nice'. In the past I've only used ElasticSearch in an environment where I'm the only user. If you can let me know some rough guidelines to stick to that would be great. [16:55:48] paladox: please put your testing bot in non-public channels. We don't need the noise. Maybe something like ##grrrit [16:56:01] Ok sorry [16:56:45] tarrow: I think we will have to figure it out as we go :) Mostly the concern I would have with the existing setup would be running things out of RAM with complex queries. [16:57:13] ok :) [16:58:56] 06Labs, 10wikitech.wikimedia.org, 13Patch-For-Review: Some versions of an image not rendering at all at wikitech - https://phabricator.wikimedia.org/T145811#2760437 (10MoritzMuehlenhoff) 05Open>03Resolved a:03MoritzMuehlenhoff Fixed, wikitech renders images again. [17:03:56] (03Draft1) 10Paladox: testing bot [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/319106 [17:06:22] (03CR) 10Paladox: "recheck" [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/319106 (owner: 10Paladox) [17:26:54] 06Labs, 06Operations, 13Patch-For-Review: Phase out the 'puppet' module with fire, make self hosted puppetmasters use the puppetmaster module - https://phabricator.wikimedia.org/T120159#2760585 (10yuvipanda) The issues with role::puppetmaster::standalone not being able to be its own client are fixed now! htt... [17:32:08] 10Quarry, 10Analytics-Wikimetrics: Include Tulu Wikipedia in Metrics and Quarry - https://phabricator.wikimedia.org/T148950#2760605 (10Pavanaja) [17:57:13] !log tools depool exec nodes on labvirt1002 [17:57:15] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [18:00:46] halfak: btw, let us know if snuggle no longer requires NFS :) will be happy to remove [18:01:53] PROBLEM - Host tools-services-01 is DOWN: CRITICAL - Host Unreachable (10.68.16.29) [18:02:01] PROBLEM - Host tools-webgrid-generic-1403 is DOWN: CRITICAL - Host Unreachable (10.68.18.52) [18:02:32] PROBLEM - Host tools-webgrid-lighttpd-1405 is DOWN: CRITICAL - Host Unreachable (10.68.17.65) [18:02:34] PROBLEM - Host tools-exec-gift is DOWN: CRITICAL - Host Unreachable (10.68.16.40) [18:02:50] PROBLEM - Host tools-redis-1001 is DOWN: CRITICAL - Host Unreachable (10.68.22.56) [18:02:56] @quiet shinken-wm [18:03:03] not sure if that's good enough? [18:03:06] let's see [18:03:49] PROBLEM - Host tools-exec-1203 is DOWN: CRITICAL - Host Unreachable (10.68.16.133) [18:03:49] PROBLEM - Host tools-webgrid-lighttpd-1204 is DOWN: CRITICAL - Host Unreachable (10.68.18.49) [18:03:51] :/ [18:04:21] PROBLEM - Host tools-webgrid-lighttpd-1401 is DOWN: CRITICAL - Host Unreachable (10.68.16.34) [18:04:47] PROBLEM - Host tools-exec-1405 is DOWN: CRITICAL - Host Unreachable (10.68.18.3) [18:05:09] PROBLEM - Host tools-k8s-etcd-03 is DOWN: CRITICAL - Host Unreachable (10.68.21.239) [18:05:13] PROBLEM - Host tools-exec-1210 is DOWN: CRITICAL - Host Unreachable (10.68.17.147) [18:05:21] @bd808: I compiled the jdbc-driver into my war-file and deployed. I does not work and I think it - conceptually - cannot, as the driver probably needs to be in the servers classpath, not in the classpath of the servlet? [18:05:26] apparently not [18:50:40] hi madhuvishy. question: there is a Tools maintenance on the way, right? and this is the one that may result in tools' unavailability for up to 48 hours? [18:51:03] leila: I just sent an email! It was tomorrow, but we had to push it [18:51:13] leila: hey! there is a general labs maintenance underway, but that's different from the one madhuvishy announced [18:51:28] this should only have partial disruption on and off for a bit. [18:51:37] leila: anything you want me to keep an eye on wlm-wise? [18:51:49] yeah yuvipanda. thanks. [18:51:58] madhuvishy: cool. /me goes to your email. [18:53:06] \o/, madhuvishy. just in time. :D [18:53:50] leila: :) [18:54:38] yuvipanda: we may need your support if the maintenance stays at 11/14. that's during WLM international jury process, and we can't have Montage down for 48 hours. [18:54:47] can you let me know how you can help, yuvipanda? [18:55:03] leila: sure! I can work with you to make sure we have montage available uninterrupted in that time [18:55:05] or even, madhuvishy: any chance that scheduled time-window be reconsidered? [18:55:17] leila: I'm also fairly sure it won't take 48 hours [18:55:22] yuvipanda: thank you. :) [18:55:25] leila: it's easier for us to make montage be available rather than resched it [18:55:33] I see, yuvipanda. [18:55:47] leila: can you setup an email thread between me and the people doing montage now? [18:55:53] leila: aah, it's the only slot available for us before I leave to India. chase is unavailable next week [18:56:17] thanks yuvipanda! [18:56:46] np madhuvishy [18:59:45] gradzeichen: yeah. I think you may be right about that for Tomcat. I need to grab some lunch first but I can try to dig into the config files for how we have tomcat running on the job grid. I'm sure there is a way to make things work. It may even be that we already have some mysql jdbc driver jar in the path. [19:02:54] bd808: is LDAP having problems? [19:03:02] 2016-11-01T19:02:21.189156+00:00 oxygen nslcd[1065]: [334873] no available LDAP server found: Server is unavailable [19:03:02] 2016-11-01T19:02:21.210945+00:00 oxygen diamond[536]: sudo: ldap_start_tls_s(): Can't contact LDAP server [19:03:02] 2016-11-01T19:02:21.223114+00:00 oxygen diamond[536]: sudo: unable to resolve host oxygen [19:03:29] and he didn't found chronium, hydrogen, acamar, and achernar .wikimedia.org [19:04:02] so I can't login via ssh key at oxygen.rcm.eqiad.wmflabs [19:07:45] can a labs-admin please take a look at that instance? I can't ssh into it [19:07:54] I will be back in ~30-45 min after dinner [19:10:27] !log tools depooled tools nodes from labvirt1004 and 1007 [19:10:30] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [19:18:30] yuvipanda hi it seems that grrrit-wm wont reconnect [19:18:33] Sagan, will be a dns problem rather than an ldap problem [19:18:43] if you can't connect to chromium, hydrogen, acamar and achernar [19:18:51] when it said ping timeout, i ssh in to check, and when doing this [19:18:52] kubectl get pods [19:19:02] it shows [19:19:02] tools.lolrrit-wm@tools-bastion-03:~/lolrrit-wm$ kubectl get pods [19:19:02] Unable to connect to the server: dial tcp 10.68.17.142:6443: getsockopt: no route to host [19:19:09] paladox: because there's ongoing labs maintenance, and the kubernetes master is just being restarted [19:19:17] Oh [19:19:18] ok [19:20:03] yuvipanda, what about the network? [19:20:22] non-tools instances shouldn't be seeing problems connecting to dns right? [19:20:28] It seems i carn't ssh into a new instance i created [19:20:47] Krenair: it should be fine, but my suspicion on Sagan's issue is probably an instance that hasn't run puppet in a long long time [19:20:56] and has wrong ldap and puppetmaster addresses [19:20:57] I managed to ssh in the first time, then it just stalled, then i closed my console and reopended it and tryed ssh in and it is not working [19:21:07] $ ssh bot-gerrit [19:21:07] channel 0: open failed: connect failed: No route to host [19:21:07] stdio forwarding failed [19:21:08] ssh_exchange_identification: Connection closed by remote host [19:21:16] paladox: ongoing labs maintenance, the bastion hosts are also probably restarting just now [19:21:18] so hang on :) [19:21:23] Oh [19:21:25] thanks [19:22:02] Oh, but i can ssh into one instance but carn't into the new one. [19:22:27] hmm might be something else. I can't really help right now tho, sorry [19:22:31] Ok [19:22:42] Krenair: but yeah, network in general should be untouched [19:22:48] ok [19:23:35] it seems to be stuck on rebooting [19:26:34] 10Tool-Labs-tools-Other: create tool to crunch metrics for views (play started) of video and audio files - https://phabricator.wikimedia.org/T116363#2760981 (10harej-NIOSH) Going to close this task as complete since the metrics crunching is now underway; T149642 is the task for implementing the UI. [19:26:35] 10Tool-Labs-tools-Other: create tool to crunch metrics for views (play started) of video and audio files - https://phabricator.wikimedia.org/T116363#2760981 (10harej-NIOSH) Going to close this task as complete since the metrics crunching is now underway; T149642 is the task for implementing the UI. [19:26:47] 10Tool-Labs-tools-Other: create tool to crunch metrics for views (play started) of video and audio files - https://phabricator.wikimedia.org/T116363#2760983 (10harej-NIOSH) 05Open>03Resolved [19:26:49] 10Tool-Labs-tools-Other: create tool to crunch metrics for views (play started) of video and audio files - https://phabricator.wikimedia.org/T116363#2760983 (10harej-NIOSH) 05Open>03Resolved [19:28:05] 06Labs, 10grrrit-wm: Request creation of labs project - https://phabricator.wikimedia.org/T149733#2761000 (10Paladox) [19:28:06] 06Labs, 10grrrit-wm: Request creation of labs project - https://phabricator.wikimedia.org/T149733#2761000 (10Paladox) [19:28:17] 06Labs, 10grrrit-wm: Request creation of grrrit-wm-test labs project - https://phabricator.wikimedia.org/T149733#2761014 (10Paladox) [19:28:19] 06Labs, 10grrrit-wm: Request creation of grrrit-wm-test labs project - https://phabricator.wikimedia.org/T149733#2761014 (10Paladox) [19:30:23] 06Labs, 10grrrit-wm: Request creation of grrrit-wm-test labs project - https://phabricator.wikimedia.org/T149733#2761000 (10yuvipanda) This should just be another tool rather than a labs project I think. [19:30:29] 06Labs, 10grrrit-wm: Request creation of grrrit-wm-test labs project - https://phabricator.wikimedia.org/T149733#2761000 (10yuvipanda) This should just be another tool rather than a labs project I think. [19:31:03] 06Labs, 10grrrit-wm: Request creation of grrrit-wm-test labs project - https://phabricator.wikimedia.org/T149733#2761018 (10Paladox) Oh then tool please? [19:31:07] 06Labs, 10grrrit-wm: Request creation of grrrit-wm-test labs project - https://phabricator.wikimedia.org/T149733#2761018 (10Paladox) Oh then tool please? [19:31:18] Why did it do ^^ that twice [19:31:54] now is not really a good time to try to do this paladox, it could be caught in a reboot during the op [19:31:57] I would wait this maint out [19:32:07] chasemp that wasent me [19:32:13] i doint do wikibugs [19:34:29] I'll deal with wikibugs once the maintenance is done [19:34:39] !log tools migrate tools-elastic-03 to labvirt1009 [19:34:42] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [19:40:51] Howdy. We just deployed our beta Wikipedia Library Card platform. It was working well but now after announcing new signups we are getting 502 errors on all pages. I contacted our developer, but can anyone else see what's going on here? [19:40:52] http://twl-test.wmflabs.org/ [19:43:53] Ocaasi: we are currently in a large maintenance period which while it should be relatively short comes with maximum volatility [19:44:03] ah, good to know! [19:44:07] a message to labs-announce should come at the end [19:52:00] !log tools move tools-elastic-03 to labvirt1010, -02 already in 09 [19:52:02] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [20:00:51] is someone free to take a look at my instance issue? [20:01:40] Sagan: check out status in the topic, maint in progress labs wise [20:01:42] wide even [20:02:54] 06Labs, 10grrrit-wm: Request creation of grrrit-wm-test labs project - https://phabricator.wikimedia.org/T149733#2761118 (10Paladox) 05Open>03declined [20:02:55] 06Labs, 07Tracking: New Labs project requests (tracking) - https://phabricator.wikimedia.org/T76375#2761119 (10Paladox) [20:02:56] 06Labs, 10grrrit-wm: Request creation of grrrit-wm-test labs project - https://phabricator.wikimedia.org/T149733#2761118 (10Paladox) 05Open>03declined [20:03:02] 06Labs, 07Tracking: New Labs project requests (tracking) - https://phabricator.wikimedia.org/T76375#2761119 (10Paladox) [20:03:16] Sagan: can you file a bug on phabricator? [20:03:31] I looked a it a tiny bit, and you've a really strange resolv.conf and I've no idea how that happened [20:04:29] yuvipanda: ok [20:05:05] Sagan: try now [20:05:31] yuvipanda: nice, that works [20:05:39] still need to file a bug? :) [20:05:42] what did you do? [20:05:45] Sagan: can you still file a bug so I can track this? [20:05:51] ok [20:07:05] (03PS5) 10Paladox: Adds a grrrit-wm restarting command for you to type in irc [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/318976 (https://phabricator.wikimedia.org/T149609) [20:08:27] !log tools depooled things on labvirt1006 and 1008 [20:08:30] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [20:08:46] 06Labs, 10Labs-Infrastructure, 10Labs-project-other: Can't ssh to oxygen.rcm.eqiad.wmflabs - https://phabricator.wikimedia.org/T149737#2761146 (10Luke081515) [20:08:48] 06Labs, 10Labs-Infrastructure, 10Labs-project-other: Can't ssh to oxygen.rcm.eqiad.wmflabs - https://phabricator.wikimedia.org/T149737#2761146 (10Luke081515) [20:09:08] yuvipanda: ^ [20:09:14] 10Labs-project-Wikistats: new possible wikifarms / hives detected - check for lists - https://phabricator.wikimedia.org/T38570#2761161 (10RobiH) Loads of new data here: https://wikiapiary.com/ https://wikiapiary.com/wiki/Websites https://wikiapiary.com/wiki/Farm:Farms Maybe even potential to join forces with t... [20:09:15] 10Labs-project-Wikistats: new possible wikifarms / hives detected - check for lists - https://phabricator.wikimedia.org/T38570#2761161 (10RobiH) Loads of new data here: https://wikiapiary.com/ https://wikiapiary.com/wiki/Websites https://wikiapiary.com/wiki/Farm:Farms Maybe even potential to join forces with t... [20:09:56] 06Labs, 10Labs-Infrastructure, 10Labs-project-other: Can't ssh to oxygen.rcm.eqiad.wmflabs - https://phabricator.wikimedia.org/T149737#2761166 (10yuvipanda) 05Open>03Resolved a:03yuvipanda Somehow the instance's /etc/resolv.conf got to: ``` root@oxygen:~# cat /etc/resolv.conf domain rcm. search rcm.... [20:09:58] 06Labs, 10Labs-Infrastructure, 10Labs-project-other: Can't ssh to oxygen.rcm.eqiad.wmflabs - https://phabricator.wikimedia.org/T149737#2761166 (10yuvipanda) 05Open>03Resolved a:03yuvipanda Somehow the instance's /etc/resolv.conf got to: ``` root@oxygen:~# cat /etc/resolv.conf domain rcm. search rcm.... [20:12:37] !log tools.stashbot Test message after the elasticsearch vms were rearranged to live on separate physical hosts [20:12:39] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.stashbot/SAL [20:32:16] !log tools depool tools things on labvirt1005 and 1009 [20:32:20] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [20:34:33] Is Tool Labs going down for a reboot? [21:19:24] yuvipanda: ok, so which name I should use - wdqs-puppet.eqiad.wmflabs or wdqs-puppet.wikidata-query.eqiad.wmflabs [21:19:34] latter [21:19:44] Error: Could not retrieve catalog from remote server: Server hostname 'wdqs-puppet.eqiad.wmflabs' did not match server certificate; expected one of wdqs-puppet.wikidata-query.eqiad.wmflabs, DNS:puppet, DNS:puppet.wikidata-query.eqiad.wmflabs, DNS:wdqs-puppet.wikidata-query.eqiad.wmflabs [21:19:52] Could https://petscan.wmflabs.org/ be restarted? It throws 502... [21:19:55] yup, you need the latter [21:20:01] Is this an planned outage? [21:20:03] yuvipanda: that's what I used [21:20:21] first or just now? [21:20:29] since the error you pasted suggests otherwise? [21:20:36] yuvipanda: right now it's configured for it [21:20:37] Urbanecm: Probably, they're doing rolling reboots. [21:20:40] but it doesn't work [21:20:54] SMalyshev: i see. this is wdqs-puppet? [21:20:58] yuvipanda: yes [21:21:03] ok, I'll take a look now [21:21:07] yuvipanda: thanks! [21:21:16] Matthew_: I was informed about an planned outage which was announced at 18:00 UTC but this time passed. Could it be this? [21:21:35] Urbanecm: it started at that time, but is ongoing [21:21:39] Urbanecm: Maybe. My instances just came up. [21:21:50] Like, 2 minutes ago. So it appears to be still ongoing. [21:22:00] yuvipanda: And when it'll end? [21:22:04] Matthew_: [21:22:07] ^ [21:22:18] Urbanecm: we're done with 10 of 13 [21:22:22] I don't know, yuvipanda would have a better answer. [21:22:28] From how many? [21:22:41] Sorry, I've misread [21:22:49] of isn't the same as or :D [21:23:17] at most another hour [21:23:52] SMalyshev: I just hand fixed the master to point to 'wdqs-puppet.wikidata-query.eqiad.wmflabs' and it seems to work fine? [21:24:04] can you change that in hiera to the same value? [21:24:20] hiera now is: puppetmaster: wdqs-puppet.wikidata-query.eqiad.wmflabs [21:24:23] yuvipanda, I recreated deployment-puppetmaster02 and ran into different issues :/ [21:24:35] SMalyshev: ok. the node lgtm [21:24:44] yuvipanda: yes, works for me too, thanks! [21:24:49] SMalyshev: \o/ cool [21:24:53] Krenair: uh oh. [21:25:52] yuvipanda, I tried making it it's own puppetmaster... but it has an ssl error [21:26:08] Krenair: I see. I'll look in aminute [21:26:19] ty [21:28:12] yuvipanda: so how I can manually change puppetmaster on a host? because I changed hiera for another host, but it is not being picked up because I think it still tries to use old one [21:28:14] Krenair: I think step 2.3 was missed in https://wikitech.wikimedia.org/wiki/Standalone_puppetmaster#Step_2:_Setup_a_puppet_client [21:28:23] Krenair: works for me now [21:28:29] SMalyshev: you can edit /etc/puppet/puppet.conf [21:28:37] ah, ok, thanks! [21:28:45] SMalyshev: following 2 in https://wikitech.wikimedia.org/wiki/Standalone_puppetmaster#Step_2:_Setup_a_puppet_client should work [21:28:50] ah [21:28:55] ty [21:29:07] yuvipanda: nope, it still says: Error: /File[/var/lib/puppet/lib]: Could not evaluate: Connection refused - connect(2) Could not retrieve file metadata for puppet://wdqs-puppetmaster.wikidata-query.eqiad.wmflabs/plugins: Connection refused - connect(2) [21:29:13] uses old hostname [21:29:27] SMalyshev: there might be two entries in /etc/puppet/puppet.conf [21:29:34] (this is one of the things role::puppet::self did) [21:29:45] SMalyshev: you might have to change 'em both [21:31:23] autosigning doesn't seem to work... :( [21:32:38] SMalyshev: yeah, we discovered that yesterday. [21:32:41] I hope to have a fix for that today or tomorrow [21:34:05] !log tools depool tools nodes on labvirt1012 [21:34:08] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [21:49:00] 10Labs-project-Wikistats: new possible wikifarms / hives detected - check for lists - https://phabricator.wikimedia.org/T38570#2761612 (10Dzahn) @Robih I heard not that long ago that Wikiapiary was not maintained anymore, did somebody take that over? [21:50:26] mutante ^^ markahashberg has tooken over wikiapriary [21:50:31] carn't spell his name [21:51:08] mutante https://en.wikipedia.org/wiki/User:MarkAHershberger [21:51:43] 10Labs-project-Wikistats: new possible wikifarms / hives detected - check for lists - https://phabricator.wikimedia.org/T38570#416608 (10Paladox) @Dzahn wikiapriary is now maintained by @MarkAHershberger https://en.wikipedia.org/wiki/User:MarkAHershberger [21:52:03] 06Labs, 10Quarry: 502 Bad Gateway on HTTP-requests to quarry.wmflabs.org - https://phabricator.wikimedia.org/T121502#1880330 (10MuhammadShuaib) Again we face the same issue. [21:52:26] paladox: oh, hexmode ! [21:52:45] paladox: that's cool [21:52:47] 06Labs, 10Quarry: 502 Bad Gateway on HTTP-requests to quarry.wmflabs.org - https://phabricator.wikimedia.org/T121502#2761620 (10MuhammadShuaib) 05Resolved>03Open [21:53:40] 06Labs, 10Quarry: 502 Bad Gateway on HTTP-requests to quarry.wmflabs.org - https://phabricator.wikimedia.org/T121502#1880330 (10Matthewrbowker) >>! In T121502#2761617, @MuhammadShuaib wrote: > Again we face the same issue. Appears to be related to https://lists.wikimedia.org/pipermail/labs-announce/2016-Nove... [21:54:59] !log tools stop gridengine-master on tools-grid-master in preparation for reboot [21:55:02] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [21:58:40] yuvipanda, autosigning is enabled but I moved a node over and got: [21:58:46] Info: Certificate Request fingerprint (SHA256): AA:52:00:49:A0:BA:31:98:B8:01:55:21:29:EE:30:36:E4:8A:61:40:31:39:7A:57:5A:E0:E7:D8:A5:DD:D0:71 [21:58:46] Info: Caching certificate for ca [21:58:46] Exiting; no certificate found and waitforcert is disabled [21:58:47] D8: Add basic .arclint that will handle pep8 and pylint checks - https://phabricator.wikimedia.org/D8 [21:59:11] do I need to restart the puppetmaster for that? [21:59:28] Krenair: good question. [21:59:39] Krenair: I think you might have to restart puppetmaster at least once after turning on autosigning [21:59:43] ah [21:59:49] puppet doesn't handle that I suppose [21:59:54] I'm not sure tho - that should've happened automatically I think [22:00:38] !log deployment-prep started moving nodes back to the new puppetmaster [22:00:44] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/SAL [22:01:21] 10Labs-project-Wikistats: new possible wikifarms / hives detected - check for lists - https://phabricator.wikimedia.org/T38570#2761633 (10Dzahn) Thanks Paladox, that's good news :) [22:08:42] 06Labs, 10Quarry: 502 Bad Gateway on HTTP-requests to quarry.wmflabs.org - https://phabricator.wikimedia.org/T121502#2761641 (10Matthewrbowker) 05Open>03Resolved Works for me now. Is related to my previous comment. [22:13:26] I'm getting "Danger: There was an error submitting the form. Please try again." when I try to apply the vagrant lxc role to an instance I just launched as a replacement for one that stopped responding [22:13:54] Is this something I should just wait until things stabilize before attempting again? [22:19:46] !log deployment-prep stopped and started udp2log-mw on -fluorine02 [22:19:52] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/SAL [22:21:15] !log deployment-prep started mysql on -db04 [22:21:20] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/SAL [22:21:52] okay [22:21:56] well it's back in read-only [22:22:10] bearloga: try in a couple mins? [22:22:14] !log deployment-prep started mysql on -db03 to hopefully pull us out of read-only mode [22:22:18] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/SAL [22:23:31] andrewbogott: Killing grid nodes with running jobs on them? :-( [22:23:53] ? [22:23:55] multichill: I usually drain them before killing them [22:24:40] I just noticed one of the jobs I fired up before I left was killed with; " Broadcast message from root@tools-exec-1402" [22:24:51] " The system is going down for power off NOW!" [22:25:04] Don't you guys know by now how to operate a grid? [22:25:13] clearly not. [22:25:54] multichill: we announced that we're going to do this last week. [22:26:09] yuvipanda: works now :) [22:26:15] bearloga: cool! [22:26:32] You're operating a grid. Why is my job scheduled on a node that is going down in less than an hour? [22:26:37] yuvipanda: idk if/what you did but thank you :) [22:26:48] multichill: this attitude isn't going to help you. [22:27:20] multichill: please tone it down, and I'm going to ignore you until you're asking a bit more productive questions. Thanks. [22:27:25] puppet syntax errors will cause 'Could not find class' right? [22:27:38] Krenair: could, yeah. [22:27:40] (unrelated to my puppetmaster move, this just happened to come up) [22:27:51] so there is Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Could not find class role::cache::text for deployment-cache-text04.deployment-prep.eqiad.wmflabs on node deployment-cache-text04.deployment-prep.eqiad.wmflabs [22:27:53] Krenair: depending on where and what context it's happening in. [22:27:59] this is using the old puppetmaster [22:28:08] which has: modules/role/manifests/cache/text.pp:class role::cache::text( [22:28:27] 'Current branch production is up to date.' [22:29:46] yuvipanda: Sorry man. Just sad. Is this by design or just an oops? [22:30:19] multichill: I think unfortunately it's by design. we tried doing that a few times ago (where we'd first stop new jobs from running, then drain restartables, wait for nonrestartables to finish) [22:30:40] multichill: however, that took multiple days, and we also have no clear way of figuring out things that really aren't restartable vs things running from cron :( [22:30:42] it takes MUCH longer to do things that way, and we're only barely fitting this in in one day as it is :( [22:31:22] In a previous job I used to be involved in grids slightly bigger (1000 nodes) and draining properly was a sport [22:32:40] multichill: I think if we figured out the cron stuff better we'd be able to do it too. plus a lot of jobs just run forever, and we've no way of knowing [22:32:47] so things would have to abruptly die at some point anyway [22:32:55] and this allows us to get it over with as quickly as possible [22:33:25] 06Labs, 10Labs-Infrastructure, 10Labs-project-other: Can't ssh into xenon.rcm.eqiad.wmflabs - https://phabricator.wikimedia.org/T149750#2761673 (10Luke081515) [22:33:58] Yeah, you need expected runtime to be able to schedule properly during drain [22:34:03] yeah [22:34:04] and we don't have that [22:34:13] which means we can't really plan [22:34:35] yuvipanda, know anything about this on shinken-01? [22:34:39] The last Puppet run was at Tue Nov 1 18:03:20 UTC 2016 (263 minutes ago). Puppet is disabled. reason not specified [22:34:43] Does jsub support it now? I haven't really checked. I had some jobs that went nowhere and a time limit would have prevented some lost cycles there [22:34:47] multichill: but I understand the frustration, and apologize. [22:34:57] multichill: yeah, I think it does, but not really sure what the option is [22:36:50] multichill: also, we're rebooting tools-login (tools-bastion-03) again, just sent a message... [22:38:20] As long as you're staying clear from tools-exec-1412, I'm good ;-) [22:39:54] multichill: even otherwise, we'd need to force everyone to specify it if we want to be able to plan this properly [22:39:55] and that's really the biggest problem. [22:39:57] multichill: that is also restarting [22:40:09] (a bunch of nodes didn't pick up kernel upgrades when we rebooted the labvirts) [22:40:39] yuvipanda: How do I submit my job now in a way that it doesn't get killed again? [22:40:41] multichill: this is a bit of a snafu tho, we'll try to automate this next time so we don't repool before checking kernels on the instances [22:40:59] multichill: wait for maybe 10mins, will probably get an 'all clear, we are done with maint', and then go ahead? [22:41:10] @bd808: Any news on the tomcat config yet? [22:41:12] sleep 3600; jsub..... [22:41:26] multichill: also with the k8s backend for jsub (which I hope to start working on soon), we'll probably have other options [22:41:38] what I want is a 'restart this job if the node goes away, and run it until the job errors or completes' [22:41:46] which I think would've made your situation less painful [22:41:54] so you wouldn't have to manually resubmit [22:42:09] In this case I'm creating tons of new items on Wikidata [22:42:10] !log tools.admin Accidentally deleted the main webservice job to tools.wmflabs.org. Should be back up now. [22:42:12] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.admin/SAL [22:42:24] Sometimes the database has lag and I end up creating tons of dupes if I'm not careful [22:42:25] Heh [22:42:34] multichill: right [22:42:40] yuvipanda: bastion-02 is all clear? [22:42:48] multichill: yeah [22:43:01] gradzeichen: gah. Haven't had time to look yet. Give me 10 minutes. [22:43:03] Sleep 3600 it is :-) [22:43:14] * bd808 is horribly forgetful today [22:44:20] yuvipanda: Dirty cow? [22:44:33] multichill: yeah [22:44:43] this is already far later than we'd have liked [22:45:04] multichill: but we also wanted to do a bunch of other upgrades that also needed restarts, so.. [22:45:15] !log tools.admin Restarted toolhistory job on trusty and fixed bigbrother entry to launch it there in the future [22:45:17] It wasn't rated that high anyway [22:45:18] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.admin/SAL [22:46:35] multichill: ok, all clear now [22:48:19] Urbanecm: I see petscan is still down [22:48:26] it probably didn't come back from the restart properly [22:48:38] Urbanecm: can you file a bug, and I'll look into it now? [22:48:57] bd808: btw, bigbrother is spamming about something admin [22:49:13] arg [22:49:18] yuvipanda, bloody puppet [22:49:23] 'could not find class'?! [22:49:34] Turns out, it was an extra 0xC2 which it didn't like in the class file! [22:49:38] aaaaaaaa [22:49:40] - # topic into�many JSON based kafka topics for further [22:49:40] + # topic into many JSON based kafka topics for further [22:49:40] lol [22:49:42] also not lol [22:49:49] yuvipanda: Thanks man, you're lucky I already filled out the survey, hehe ;-) (did you get a lot of submissions here?) [22:49:56] multichill: :D [22:50:06] multichill: I am not sure - bd808 is doing it this year [22:50:19] so what else needs fixing... [22:50:32] right, yuvipanda know anything about the shinken-01 puppet disabling? [22:50:39] !log tools.admin bigbrother sucks. use `jstart` instead of `jsub -once -continuious` [22:50:41] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.admin/SAL [22:50:44] Krenair: yeah, that was me. I should've left a message :| [22:51:01] Krenair: feel free to re-anable. it was to kill ircecho and shut up shinken-wm [22:51:12] indeed. [22:51:30] I'll get to that after I've committed and uploaded this puppet patch [22:52:06] Urbanecm: it looks like magnus was runnin petscan inside a screen, and that died with the reboot [22:52:16] Urbanecm: I don't know how to bring it back, so will have to ping magnus [22:52:38] yuvipanda: I think I fixed the bb spam but it looks like it queued a ton of them [22:52:47] bd808: yeah [22:52:48] not surprised [22:53:07] like multiple per minute! [22:53:51] bd808: :D [22:54:32] yuvipanda: I just submitted job 9992829, party at 10000000? :P [22:54:38] multichill: :D [22:54:49] :appropriate-emoji: [22:54:53] !log shinken enabled puppet, shinken is back [22:54:54] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Shinken/SAL [22:55:03] ty Krenair [22:55:45] !log petscan start petscan inside a screen for magnus [22:55:47] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Petscan/SAL [22:56:24] Urbanecm: I've tried to start it, and it looks like it might work [22:56:31] yuvipanda, https://gerrit.wikimedia.org/r/#/c/319243/ [22:56:32] multichill: I know it rolls over at some point, not sure at when [22:56:34] RECOVERY - Puppet run on tools-flannel-etcd-02 is OK: OK: Less than 1.00% above the threshold [0.0] [22:57:02] Krenair: can you get a +1 from 5 different people with at least 3 different hair colors? [22:57:02] PROBLEM - Puppet run on tools-puppetmaster-02 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [22:57:12] RECOVERY - Host tools-secgroup-test-103 is UP: PING OK - Packet loss = 0%, RTA = 0.94 ms [22:57:14] I modified a comment [22:57:20] ty [22:57:20] RECOVERY - Host tools-worker-1005 is UP: PING OK - Packet loss = 0%, RTA = 0.82 ms [22:58:04] PROBLEM - SSH on tools-docker-builder-02 is CRITICAL: Connection refused [22:58:43] I think I just replaced c2 a0 with 20 [22:59:21] what was next... right, deployment-mira ssh via keyholder getting denied [22:59:31] andrewbogott: can you bump tools-docker-02? It doesn't seem to be back up [22:59:45] gradzeichen: Ok, I found the startup script that gets run for the tomcat backend -- https://phabricator.wikimedia.org/diffusion/OSTW/browse/master/scripts/deprecated-tomcat-starter [23:00:20] Krenair: want any help in moving other nodes? or do you wanna do them one by one? [23:01:26] nah [23:01:29] I can deal with that [23:01:33] This, however: [23:01:37] Nov 1 23:00:22 deployment-mediawiki04 sshd[9136]: Failed publickey for mwdeploy from 10.68.20.135 port 60658 ssh2: RSA cb:43:c3:76:3a:68:b0:29:1a:8f:3f:31:1f:87:f8:7c [23:01:43] Nov 1 23:00:38 deployment-mediawiki04 sshd[9144]: Accepted publickey for mwdeploy from 10.68.21.205 port 56836 ssh2: RSA cb:43:c3:76:3a:68:b0:29:1a:8f:3f:31:1f:87:f8:7c [23:01:48] yuvipanda: tools-docker-02? In what project? [23:01:49] THEY'RE THE SAME KEY [23:01:58] I don't see anything with that name in tools [23:02:02] andrewbogott: tools-docker-builder-02 [23:02:08] sorry, too many words [23:02:18] Krenair: ouch [23:02:27] Krenair: check if there's a warning about rdns above that maybe? [23:02:36] I ran into something like that with clush [23:02:36] not that I see [23:02:49] there is this though [23:02:50] Nov 1 23:00:22 deployment-mediawiki04 sshd[9136]: pam_access(sshd:account): access denied for user `mwdeploy' from `deployment-mira.deployment-prep.eqiad.wmflabs' [23:02:54] correct hostname [23:02:55] ah right [23:03:09] maybe a security/access.conf thing? [23:03:19] no PTR weirdness on that IP [23:03:30] ok [23:03:39] possibly access.conf [23:03:44] yuvipanda: do you know if the tomcat backend even works on our trusty webgrid nodes? [23:04:04] bd808: the realistic answer is 'no'. I know that it worked when I last touched it. [23:04:07] ugh [23:04:08] +:ALL:LOCAL [23:04:08] + : mwdeploy : 10.68.21.205 [23:04:09] -:ALL EXCEPT (project-deployment-prep) root:ALL [23:04:10] great [23:04:34] PROBLEM - Host tools-secgroup-test-103 is DOWN: CRITICAL - Host Unreachable (10.68.21.22) [23:05:09] Yeah this is local puppet stuff [23:05:15] Can't believe no one else has ever noticed it though [23:06:34] PROBLEM - Host tools-docker-builder-02 is DOWN: CRITICAL - Host Unreachable (10.68.23.105) [23:08:08] RECOVERY - Host tools-docker-builder-02 is UP: PING OK - Packet loss = 0%, RTA = 0.61 ms [23:08:25] yuvipanda: that host says [23:08:28] Welcome to emergGive root password for maintenance [23:08:28] (or type Control-D to continue): [23:08:53] so it's hosed? [23:08:59] yuvipanda: I'm trying to wrap up so I can go to dinner, but if that's user-facing I can dig deeper... [23:09:14] Definitely hosed, may be possible to rescue with an offline mount [23:09:18] andrewbogott: it's just me and bd808 facing, so it can wait for tomorrow [23:09:21] that's not one of the ones I did a special reboot for just now is it? [23:09:31] andrewbogott: we can try a minimal rescue, and if that doesn't work I can rebuilkd [23:09:33] *rebuild [23:10:05] ok [23:10:11] remind me tomorrow to dig deeper [23:12:31] andrewbogott: ok! [23:20:42] yuvipanda: andrewbogott: Sorry for being a bit grumpy before [23:21:25] (03PS6) 10Paladox: Adds a grrrit-wm restarting command for you to type in irc [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/318976 (https://phabricator.wikimedia.org/T149609) [23:24:13] multichill: no worries! understandable, and thank you for understanding our plight too :) [23:25:21] Okay, success [23:28:17] yuvipanda, https://gerrit.wikimedia.org/r/319249 [23:28:39] these are beta-only classes [23:29:12] seem to be duplicates but I'm not solving that problem right now [23:29:22] it behaves as expected [23:30:50] Krenair: ok [23:30:59] Krenair: I assume deployment_hosts is hand maintained in hiera and has the right IPs? [23:31:20] it does have the right IPs [23:31:27] -+ : mwdeploy : 10.68.21.205 [23:31:27] ++ : mwdeploy : 10.68.21.205 10.68.20.135 [23:31:33] the extra one is correct [23:31:40] it's actually maintained in the puppet repo though [23:31:47] in modules/network/manifests/constants.pp [23:32:00] Krenair: I merged it [23:32:05] ah ok [23:32:06] nice [23:32:35] So, my next thing is [23:32:43] Think I'm going back to moving puppetmasters [23:33:01] * yuvipanda nods [23:34:27] @bd808: so, if you add mysql-connector-java-5.1.40-bin.jar to $CLASSPATH then it might work? [23:35:45] gradzeichen: I think I may have figured it out. $CATALINA_HOME points to /data/project/$tool/public_tomcat. According to https://tomcat.apache.org/tomcat-7.0-doc/class-loader-howto.html is should magically load jar files found in $CATALINA_HOME/lib [23:36:04] Krenair: I'm going afk to turn on somelaundry, I'll be back in 5-10min [23:36:13] yuvipanda, k, thanks [23:36:37] RECOVERY - Host secgroup-lag-102 is UP: PING OK - Packet loss = 0%, RTA = 1.11 ms [23:36:48] autosigning appears to work \o/ [23:37:00] gradzeichen: it should also work to put the mysql jar in /WEB-INF/lib/ in your jar file, but that will only let your app code see it and not the outer tomcat process [23:40:17] Hi it seems that login.tools.wmflabs.org is really slow [23:40:32] it is slow to ssh in and i have an 80mbps connection [23:43:29] oh, nope [23:43:30] it doesn't [23:44:37] gradzeichen: I added some notes at https://wikitech.wikimedia.org/wiki/Help:Tool_Labs/Web#Java_.28Tomcat.29 -- If you get things to actually work it might be nice to review and expand those docs. [23:46:50] Krenair: am back [23:47:02] https://docs.puppet.com/puppet/3.8/reference/ssl_autosign.html talks about having a whitelist file [23:47:43] Krenair: hmm, it's the same thing we use for the labs master [23:48:16] PROBLEM - Host secgroup-lag-102 is DOWN: CRITICAL - Host Unreachable (10.68.17.218) [23:49:24] Krenair: I see [23:49:25] > autosign = true [23:49:40] Krenair: which should work [23:49:52] Krenair: although, I guess we should at least do the whitelist domains [23:53:32] Krenair: given our environment, I think I see no difference between all 3 autosigning methods [23:53:35] they're all equivalent to naive [23:53:36] yuvipanda, see deployment-cache-text04 [23:53:43] ok looking [23:54:56] krenair@deployment-puppetmaster02:~$ sudo puppet cert list [23:54:56] Warning: Setting templatedir is deprecated. See http://links.puppetlabs.com/env-settings-deprecations [23:54:56] (at /usr/lib/ruby/vendor_ruby/puppet/settings.rb:1139:in `issue_deprecation_warning') [23:54:57] "deployment-cache-text04.deployment-prep.eqiad.wmflabs" (SHA256) 26:FB:13:42:5E:E4:70:7E:66:55:5A:73:76:EE:DF:35:B7:4D:A0:5C:41:FC:33:38:7B:DD:57:D8:7C:1A:26:8A [23:54:57] D8: Add basic .arclint that will handle pep8 and pylint checks - https://phabricator.wikimedia.org/D8 [23:55:55] right [23:56:38] * legoktm pets stashbot [23:57:18] Krenair: am going to compare the config with labcontrol1001 [23:57:53] Krenair: ok, there's no material difference