[00:00:22] Krenair: ok, something else going on in the master [00:00:50] Krenair: am looking at certmanager and trying to understand what it's doing [00:01:20] or at least where it's doing [00:04:05] @bd808: i created public_tomcat/lib, added the mysql-jar to it, restarted tomcat, but it does not show up in the classpath: //tools.wmflabs.org/isbn-tmptest/IsbnCheckAndFormat?debug [00:04:25] it isn't running on labcontrol [00:13:32] gradzeichen: "com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure" -- that looks like the jar is loading but some other config needs to be adjusted. [00:19:08] Does grid support all python 2? [00:20:40] Zppix: can you rephrase your question? Yes we have python2 available on both the OGE grid (jsub) and the kubernetes cluster (webstart) [00:21:02] Sorry i meant for jsub [00:21:45] The part that makes me wonder is "all python 2". I'm not sure what you are asking [00:24:33] bd808: Hey [00:24:45] hey d3r1ck [00:25:04] Whats up bd808? I liked the survey you sent out. [00:26:03] thanks. Mostly I just changed the dates and yuvipanda's name to mine from the version that they created and sent last year. :) [00:27:01] * bd808 goes looking for dinner [00:31:05] bd808: python 2.27 ect [00:32:26] (03CR) 10Paladox: "test" [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/319106 (owner: 10Paladox) [00:38:46] Wikitech and horizon 2FA isn't working for me. [00:40:46] PROBLEM - Puppet run on tools-puppetmaster-01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [00:41:52] what does it do tom29739? [00:43:01] Krenair: on horizon it says "Invalid credentials." [00:44:44] And on wikitech it says "Failed to validate two-factor credentials" when I try to turn 2FA off. I'm currently logged into wikitech. [00:54:19] huh. andrewbogott: ^ [00:54:46] yuvipanda, did you figure out what is up with autosigning? [00:55:09] Krenair: nope, I couldn't track it down. I realized andrewbogott probably did some work around this a while ago, so I'll wait for him to come back [00:55:16] ok [00:55:58] 06Labs, 10Beta-Cluster-Infrastructure: Move deployment-prep to role::puppetmaster::standalone - https://phabricator.wikimedia.org/T149620#2762053 (10AlexMonk-WMF) [01:07:36] Zppix: on the trusty servers, python is 2.7.6. On the jessie kubernetes servers it's 2.7.9. [01:39:11] 06Labs, 10Labs-Kubernetes, 10Tool-Labs: Switch default backend for tools webservices to be kubernetes (Tracking) - https://phabricator.wikimedia.org/T149762#2762071 (10yuvipanda) [02:01:40] (03PS7) 10Paladox: Adds a grrrit-wm restarting command for you to type in irc [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/318976 (https://phabricator.wikimedia.org/T149609) [02:01:56] (03PS8) 10Paladox: Adds a grrrit-wm restarting command for you to type in irc [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/318976 (https://phabricator.wikimedia.org/T149609) [02:03:54] (03CR) 10Paladox: "@Legoktm this should now restart the ssh connection wilst the bot stays connected to irc :)." [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/318976 (https://phabricator.wikimedia.org/T149609) (owner: 10Paladox) [02:17:35] (03CR) 10Paladox: [C: 031] "test comment." [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/318976 (https://phabricator.wikimedia.org/T149609) (owner: 10Paladox) [02:17:38] (03CR) 10Paladox: [C: 031] "test comment." [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/318976 (https://phabricator.wikimedia.org/T149609) (owner: 10Paladox) [02:17:56] (03CR) 10Paladox: Adds a grrrit-wm restarting command for you to type in irc [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/318976 (https://phabricator.wikimedia.org/T149609) (owner: 10Paladox) [02:17:58] (03CR) 10Paladox: Adds a grrrit-wm restarting command for you to type in irc [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/318976 (https://phabricator.wikimedia.org/T149609) (owner: 10Paladox) [02:21:00] (03PS9) 10Paladox: Adds a grrrit-wm restarting command for you to type in irc [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/318976 (https://phabricator.wikimedia.org/T149609) [02:21:03] (03PS9) 10Paladox: Adds a grrrit-wm restarting command for you to type in irc [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/318976 (https://phabricator.wikimedia.org/T149609) [02:27:11] 06Labs, 10Wikimedia-Site-requests, 10wikitech.wikimedia.org, 13Patch-For-Review: Wikitech: Switch over from using extension SemanticForms to PageForms - https://phabricator.wikimedia.org/T149749#2762125 (10Dereckson) [02:52:42] (03CR) 10Dzahn: "this looks much better to me now since it does not kill the bot and IRC conneciton anymore but just the ssh connection to gerrit, which is" [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/318976 (https://phabricator.wikimedia.org/T149609) (owner: 10Paladox) [06:36:40] PROBLEM - Puppet run on tools-webgrid-lighttpd-1418 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [07:11:40] RECOVERY - Puppet run on tools-webgrid-lighttpd-1418 is OK: OK: Less than 1.00% above the threshold [0.0] [07:28:36] @bd808: I tried to change my config, but still get the CommunicationsException. I have no idea why, please take a look at //tools.wmflabs.org/isbn-tmptest/IsbnCheckAndFormat?debug (user test, pw heute) for a full listing of the exception, the connection gets refused??? [07:43:41] gradzeichen: has this just been happening today? [07:44:35] if so, [Labs-l] [Labs-announce] Many instance reboots Tuesday, 2016-11-01 at 18:00 UTC [07:44:45] I have db issues also [07:46:09] brought down my entire project basically. :( [07:46:16] oh well, tomorrow [07:47:37] PROBLEM - Host tools-secgroup-test-102 is DOWN: CRITICAL - Host Unreachable (10.68.21.170) [07:53:55] @AmandaNP: today and yesterday. but this is a java issue. [07:54:59] ahh ok. if it's a coding thing, do you want me to take a look at it or is it over complicated? [07:57:48] AmandNP: please try it, look in the url i gave [07:59:50] 06Labs, 10Labs-Infrastructure, 10DBA, 10MassMessage, and 3 others: mysqld process hang in db1069 - S2 mysql instance - https://phabricator.wikimedia.org/T145077#2762301 (10Marostegui) Interesting update on the percona bug: ``` The stack trace shows TokuDB waiting for a row lock. There is a pull request... [08:14:00] gradzeichen: so 1) do you have a publicly viewable code 2) what are you trying to connect to? [08:26:28] AmandaNP: Code at the moment only in my account, I am trying to connect to "jdbc:mysql://127.0.0.1:3306/s51370__isbntool", tried with s51370 and s53182 [08:28:18] so inside your tool. has it worked before 16 hours ago? [08:40:11] AmandaNP: Its my first try to access a db on labs by jdbc/tomcat [08:41:07] The sql code will only try to do a simple select. but it fails on connecting [08:46:50] and you've provided your username and password for the database access right? [08:46:53] in your code? [08:57:34] yes, it is at the moment hardcoded. But: I get an connection refused, so it is probably even before user and pw are sent, or else there should be some unauthorized message [09:02:58] Hi when i try to become lolrrit-wm i get this error [09:03:00] paladox@tools-bastion-03:~$ become lolrrit-wm [09:03:00] -bash: fork: retry: No child processes [09:03:00] c-bash: fork: retry: No child processes -bash: fork: retry: No child processes [09:04:08] works now [09:05:04] gradzeichen: so the question is can you mysql in on the instance? [09:07:19] like when you've ssh'd in [09:08:29] if the answer is no, then tell me what service mysql status says [09:11:45] 06Labs, 10Labs-Kubernetes, 10Tool-Labs: Switch default backend for tools webservices to be kubernetes (Tracking) - https://phabricator.wikimedia.org/T149762#2762071 (10zhuyifei1999) Will legacy lighttpd + cgi-based tools still be supported? (Should I create a task for this or there's already one?) [09:12:24] I can access sql on the command line, but then get access denied, when selecting the database (it belongs to project isbn, i use project isbn-tmptest) [09:13:09] I thought, that everyone had read access to the database ( [09:13:33] and if not, hoped for a more descriptive error message?) [09:18:43] oh your on tool labs...I don't know much about tool labs dbs. I'm speaking more from your own instance view. let me see if I can find some docs for you [09:21:29] https://wikitech.wikimedia.org/wiki/Help:Tool_Labs/Database might be of help [09:22:20] it sounds like it's labs specific. if you want total freedom of what you can do, you own instance might be the way to go. But that also comes at the cost of you maintaining it [09:22:27] sorry i'm not more help [09:22:40] 06Labs, 10Labs-Infrastructure, 10DBA, 10MassMessage, and 3 others: mysqld process hang in db1069 - S2 mysql instance - https://phabricator.wikimedia.org/T145077#2762576 (10jcrespo) That should not be it, because replication changes are commited in order, or in a single thread. However, if the issue happens... [10:06:50] (03PS10) 10Paladox: Adds a grrrit-wm restarting command for you to type in irc [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/318976 (https://phabricator.wikimedia.org/T149609) [10:06:53] (03PS10) 10Paladox: Adds a grrrit-wm restarting command for you to type in irc [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/318976 (https://phabricator.wikimedia.org/T149609) [10:07:46] (03CR) 10Paladox: "> this looks much better to me now since it does not kill the bot and" [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/318976 (https://phabricator.wikimedia.org/T149609) (owner: 10Paladox) [10:07:49] (03CR) 10Paladox: "> this looks much better to me now since it does not kill the bot and" [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/318976 (https://phabricator.wikimedia.org/T149609) (owner: 10Paladox) [10:23:49] (03PS11) 10Paladox: Adds a grrrit-wm restarting command for you to type in irc [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/318976 (https://phabricator.wikimedia.org/T149609) [10:26:14] (03CR) 10Paladox: "test" [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/319106 (owner: 10Paladox) [10:26:41] (03CR) 10Paladox: "test" [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/319106 (owner: 10Paladox) [10:34:11] (03PS12) 10Paladox: Adds a grrrit-wm restarting command for you to type in irc [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/318976 (https://phabricator.wikimedia.org/T149609) [11:06:12] (03PS13) 10Paladox: Adds a grrrit-wm restarting command for you to type in irc [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/318976 (https://phabricator.wikimedia.org/T149609) [11:34:41] (03CR) 10Paladox: "test" [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/319106 (owner: 10Paladox) [11:42:04] (03PS14) 10Paladox: Adds a grrrit-wm restarting command for you to type in irc [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/318976 (https://phabricator.wikimedia.org/T149609) [11:44:53] (03CR) 10Zppix: "Whitelist.js is a current mockup will be working on code ASAP" [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/318976 (https://phabricator.wikimedia.org/T149609) (owner: 10Paladox) [12:02:27] (03CR) 10Paladox: "test" [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/319106 (owner: 10Paladox) [12:15:54] (03PS15) 10Paladox: Adds a grrrit-wm restarting command for you to type in irc [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/318976 (https://phabricator.wikimedia.org/T149609) [13:07:28] Why is task@tools-exec-1410.eqiad.wmflabs "temporarily not available"? (Background: I want my continuous jobs to automatically re-exec themselves when I push a new code revision. But NFS delays can mean that a job gets the re-exec signal through Redis before the exec node gets the updated code, so my solution was to schedule a task for each exec node running my code to first watch for the update and then signal the jobs on that exec node. But [13:07:28] e.g. job 9994060 from yesterday is stuck pending because the queue is unavailable to run it.) [13:40:23] anomie: taken offline for maint/testing [13:40:54] chasemp: It still seems to be getting continuous@ jobs though? [13:42:00] seems so but that shouldn't be [13:42:16] it seems like the depool and the mass updates and reboots yesterday had a bad interaction [13:42:38] depooling again [13:42:50] !log depool tools-exec-1404 for maint [13:42:51] Unknown project "depool" [13:42:56] @bd808: Still no solution for tomcat/jdbc connection. The tool now loads its configut [13:42:58] !log tools depool tools-exec-1404 for maint [13:43:01] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [13:43:37] anomie: interesting thing really, I wasn't aware our maint depools may not persist after reboot [13:43:42] will ahve to check that out in itself [13:44:13] ration from /data/project/isbn-tmptest/public_tomcat/webapps/isbn-tmptest/isbn.properties I have provided to test sets in isbntest.properties and isbnprod.properties the result can be seen in [13:44:45] https://tools.wmflabs.org/isbn-tmptest/IsbnCheckAndFormat?debug (user test pw heute) [13:45:24] the error is: Caused by: java.net.ConnectException: Connection refused [13:45:52] I can only think, it has something to do with rights on toollabs?? [16:09:07] if I do ssh-add -l then I see my private key [16:09:49] Drop the -p 29418 too [16:09:53] This isn't gerrit [16:10:02] yeah, i copypasted the wrong thing [16:10:32] is it not ssh -i ~/.ssh/id_rsa *myshellaccountname* (don't know if supposed to be private)@login.tools.wmflabs.org [16:10:55] I think you can do it with way, too, yes [16:12:02] so, it says my passwords should be "publickey,keyboard-interactive,hostbased" [16:13:09] great, now it works for no reason [16:13:11] argh [16:13:15] sorry for wasting your time [16:14:40] andrewbogott: around? [16:21:23] AmandaNP: what's up? [16:22:02] re. the email I sent you about utrs, that's what I thought till I ran df and sudo'd the service start and got: [16:22:09] https://www.irccloud.com/pastebin/ZsLbutSf/ [16:22:27] 06Labs, 10Wikimedia-Site-requests, 10wikitech.wikimedia.org, 13Patch-For-Review: Wikitech: Switch over from using extension SemanticForms to PageForms - https://phabricator.wikimedia.org/T149749#2763900 (10Dereckson) @Andrew Could you comment on this issue? [16:22:57] nothing seems full. and ya now that i'm awake I realize /mnt isn't on the network [16:23:58] I'll log in a poke around, but if the system boots and you can log in then whatever's happening is surely unrelated to the maintenance yesterday [16:24:46] k, well then definitely unrelated [16:25:45] AmandaNP: I think it's maybe a permission error, which means it can't write, which is causing an (incorrect) disk full error? [16:27:16] like your thinking on mnt? (cause the permission only fails on - or says it only fails - when non-sudo'd) [16:27:24] * AmandaNP checks files in mnt [16:29:20] everything there belongs to mysql except -rw-rw---- 1 root root 6 Sep 14 06:28 mysql_upgrade_info [16:31:04] the last time I know the tool was running was 17:20 UTC, 1 November 2016 [16:33:37] hm, might be that apparmor doesn't like having mysql in that dir [16:36:33] AmandaNP: try adjusting /etc/apparmor.d/usr.sbin.mysqld to allow for the location of your databases? [16:39:28] it already has it I think? even /mnt/mysql r and /mnt/mysql/** rwk [16:39:49] along with several other mysql ting [16:39:52] *things [16:41:57] yeah, seems like [16:45:56] AmandaNP: for some reason mkdir /var/lib/mysql-files seems to have helped [16:46:38] lovely... [16:46:49] * AmandaNP wonders what changed that... [16:47:29] yep thank you. everything is back up [16:47:32] interesting [16:49:06] labsdb replica lag should now start going down again [17:10:30] Hi, labs-admin online? [17:13:49] what do you need Sagan? [17:14:05] I created https://phabricator.wikimedia.org/T149750 yesterday [17:14:11] and I still can't access that instance :/ [17:16:10] Sagan, you can't get to instances by running 'ssh' on a bastion [17:16:26] it works for 3 other instances [17:16:35] as the bastion won't have your SSH keys [17:16:36] Krenair: ^ [17:16:36] (03CR) 10MarkTraceur: [C: 04-1] "If you're going to persist in writing WIP code, please mark it as such, so the rest of us don't spend time reading totally useless changes" (0313 comments) [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/318976 (https://phabricator.wikimedia.org/T149609) (owner: 10Paladox) [17:17:00] meh, I actually use agent forwarding. I use that priv key only for labs [17:17:03] Sagan, what did you enable agent forwarding or something? [17:17:06] You should disable that [17:19:17] Krenair: I will take a look, how to do it the alternative way at putty. But anyway, my access is broken :/ [17:42:28] !log tools drain nodes from labvirt1012 and 13 [17:42:31] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [17:56:00] 06Labs, 10Labs-Infrastructure, 10DBA, 06Operations: Prepare and check production and labs-side filtering for olowiki - https://phabricator.wikimedia.org/T147302#2764229 (10jcrespo) [17:56:03] 06Labs, 10Labs-Infrastructure, 10DBA, 06Operations: Create maintain-views user for labsdb1001 and labsdb1003 - https://phabricator.wikimedia.org/T148560#2764226 (10jcrespo) 05Open>03Resolved a:03jcrespo I am assuming this as resolved because I think it is done and saw no complain it is not working fr... [18:02:50] PROBLEM - Host tools-logs-02 is DOWN: CRITICAL - Host Unreachable (10.68.22.34) [18:03:40] PROBLEM - Host tools-worker-1010 is DOWN: CRITICAL - Host Unreachable (10.68.20.94) [18:03:50] PROBLEM - Host tools-exec-1418 is DOWN: CRITICAL - Host Unreachable (10.68.23.142) [18:03:56] PROBLEM - Host tools-k8s-etcd-01 is DOWN: PING CRITICAL - Packet loss = 100% [18:03:58] yuvipanda: you have time to look at my instnace? [18:04:02] *instance [18:04:16] PROBLEM - Host tools-worker-1003 is DOWN: CRITICAL - Host Unreachable (10.68.23.240) [18:04:18] PROBLEM - Host tools-flannel-etcd-03 is DOWN: CRITICAL - Host Unreachable (10.68.22.169) [18:04:24] PROBLEM - Host tools-static-11 is DOWN: CRITICAL - Host Unreachable (10.68.17.208) [18:04:32] PROBLEM - Host tools-exec-1411 is DOWN: CRITICAL - Host Unreachable (10.68.17.209) [18:04:47] tell me all about it, shinken [18:04:57] PROBLEM - Host tools-worker-1012 is DOWN: CRITICAL - Host Unreachable (10.68.22.74) [18:05:05] PROBLEM - Host tools-worker-1013 is DOWN: CRITICAL - Host Unreachable (10.68.21.243) [18:05:06] andrewbogott: or do you have time? [18:05:09] PROBLEM - Host tools-prometheus-01 is DOWN: CRITICAL - Host Unreachable (10.68.21.162) [18:05:11] PROBLEM - Host tools-worker-1007 is DOWN: CRITICAL - Host Unreachable (10.68.23.53) [18:05:23] PROBLEM - Host tools-worker-1002 is DOWN: CRITICAL - Host Unreachable (10.68.22.43) [18:05:34] Sagan: not for a few hours, sorry [18:05:39] ok [18:06:08] PROBLEM - Host tools-worker-1021 is DOWN: CRITICAL - Host Unreachable (10.68.22.153) [18:06:10] PROBLEM - Host tools-worker-1006 is DOWN: CRITICAL - Host Unreachable (10.68.17.89) [18:06:12] PROBLEM - Host tools-worker-1016 is DOWN: CRITICAL - Host Unreachable (10.68.21.253) [18:10:41] 06Labs, 10Tool-Labs: Unconfirm account email addresses for Wikitech accounts that bounced during 2016 survey mailings - https://phabricator.wikimedia.org/T149824#2764282 (10bd808) [18:11:55] RECOVERY - Host tools-exec-1418 is UP: PING OK - Packet loss = 0%, RTA = 0.67 ms [18:12:44] RECOVERY - Host tools-k8s-etcd-01 is UP: PING OK - Packet loss = 0%, RTA = 0.55 ms [18:12:44] RECOVERY - Host tools-logs-02 is UP: PING OK - Packet loss = 0%, RTA = 1.40 ms [18:13:08] RECOVERY - Host tools-static-11 is UP: PING OK - Packet loss = 0%, RTA = 1.04 ms [18:13:26] RECOVERY - Host tools-flannel-etcd-03 is UP: PING OK - Packet loss = 0%, RTA = 0.74 ms [18:13:28] RECOVERY - Host tools-prometheus-01 is UP: PING OK - Packet loss = 0%, RTA = 1.26 ms [18:13:40] RECOVERY - Host tools-worker-1010 is UP: PING OK - Packet loss = 0%, RTA = 2.38 ms [18:13:46] RECOVERY - Host tools-worker-1007 is UP: PING OK - Packet loss = 0%, RTA = 4.68 ms [18:13:47] RECOVERY - Host tools-exec-1411 is UP: PING OK - Packet loss = 0%, RTA = 1.69 ms [18:14:00] RECOVERY - Host tools-worker-1002 is UP: PING OK - Packet loss = 0%, RTA = 1.44 ms [18:14:07] RECOVERY - Host tools-worker-1021 is UP: PING OK - Packet loss = 0%, RTA = 0.93 ms [18:14:25] RECOVERY - Host tools-worker-1003 is UP: PING OK - Packet loss = 0%, RTA = 1.76 ms [18:14:27] RECOVERY - Host tools-worker-1016 is UP: PING OK - Packet loss = 0%, RTA = 0.77 ms [18:14:59] RECOVERY - Host tools-worker-1013 is UP: PING OK - Packet loss = 0%, RTA = 3.78 ms [18:15:09] RECOVERY - Host tools-worker-1012 is UP: PING OK - Packet loss = 0%, RTA = 2.40 ms [18:15:17] RECOVERY - Host tools-worker-1006 is UP: PING OK - Packet loss = 0%, RTA = 0.77 ms [18:19:49] RECOVERY - Puppet run on tools-worker-1013 is OK: OK: Less than 1.00% above the threshold [0.0] [18:23:13] !log tools manually stop tools-grid-master for reboot [18:23:16] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [18:24:22] PROBLEM - Host tools-webgrid-lighttpd-1415 is DOWN: PING CRITICAL - Packet loss = 100% [18:24:26] PROBLEM - Host tools-worker-1008 is DOWN: CRITICAL - Host Unreachable (10.68.21.13) [18:24:54] PROBLEM - Host tools-worker-1001 is DOWN: CRITICAL - Host Unreachable (10.68.23.55) [18:25:04] PROBLEM - Host tools-worker-1019 is DOWN: CRITICAL - Host Unreachable (10.68.18.25) [18:25:06] PROBLEM - Host tools-docker-builder-01 is DOWN: CRITICAL - Host Unreachable (10.68.19.180) [18:25:26] PROBLEM - Host tools-exec-1415 is DOWN: CRITICAL - Host Unreachable (10.68.20.251) [18:25:45] PROBLEM - Host tools-grid-master is DOWN: CRITICAL - Host Unreachable (10.68.20.158) [18:25:59] PROBLEM - Host tools-worker-1020 is DOWN: CRITICAL - Host Unreachable (10.68.17.223) [18:26:07] PROBLEM - Host tools-worker-1009 is DOWN: CRITICAL - Host Unreachable (10.68.18.26) [18:26:07] PROBLEM - Host tools-worker-1022 is DOWN: CRITICAL - Host Unreachable (10.68.21.130) [18:26:12] @bd808: It works now. I have also added infos to Help:Tool Labs/Java, Help:Tool Labs/Database, and Help:Tool Labs/Web. [18:26:29] gradzeichen: \o/ [18:26:37] thanks for updating docs! [18:26:43] PROBLEM - Host tools-prometheus-02 is DOWN: CRITICAL - Host Unreachable (10.68.17.221) [18:26:58] PROBLEM - Host tools-exec-1420 is DOWN: CRITICAL - Host Unreachable (10.68.21.42) [18:27:00] PROBLEM - Host tools-k8s-etcd-02 is DOWN: CRITICAL - Host Unreachable (10.68.18.64) [18:27:10] PROBLEM - Host tools-exec-1412 is DOWN: CRITICAL - Host Unreachable (10.68.23.154) [18:27:54] !log bots Manually restarted wm-bot as it didn't come back after reboot. [18:27:56] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Bots/SAL [18:31:19] 06Labs, 10Labs-Infrastructure, 10DBA, 06Operations: Create maintain-views user for labsdb1001 and labsdb1003 - https://phabricator.wikimedia.org/T148560#2764404 (10jcrespo) Actually there was one issue with entry point (using the socket instead of the hostname), and another one with a check which required... [18:34:27] RECOVERY - Host tools-exec-1412 is UP: PING OK - Packet loss = 0%, RTA = 2.31 ms [18:35:04] RECOVERY - Host tools-exec-1415 is UP: PING OK - Packet loss = 0%, RTA = 2.61 ms [18:35:04] RECOVERY - Host tools-webgrid-lighttpd-1415 is UP: PING OK - Packet loss = 0%, RTA = 0.57 ms [18:35:16] RECOVERY - Host tools-prometheus-02 is UP: PING OK - Packet loss = 0%, RTA = 2.04 ms [18:35:20] RECOVERY - Host tools-exec-1420 is UP: PING OK - Packet loss = 0%, RTA = 1.85 ms [18:35:48] RECOVERY - Host tools-worker-1019 is UP: PING OK - Packet loss = 0%, RTA = 4.34 ms [18:36:00] RECOVERY - Host tools-worker-1020 is UP: PING OK - Packet loss = 0%, RTA = 0.80 ms [18:36:04] RECOVERY - Host tools-worker-1009 is UP: PING OK - Packet loss = 0%, RTA = 1.58 ms [18:36:10] RECOVERY - Host tools-worker-1022 is UP: PING OK - Packet loss = 0%, RTA = 1.30 ms [18:36:16] RECOVERY - Host tools-worker-1001 is UP: PING OK - Packet loss = 0%, RTA = 0.75 ms [18:36:45] RECOVERY - Host tools-grid-master is UP: PING OK - Packet loss = 0%, RTA = 0.42 ms [18:37:15] RECOVERY - Host tools-k8s-etcd-02 is UP: PING OK - Packet loss = 0%, RTA = 0.51 ms [18:37:35] RECOVERY - Host tools-docker-builder-01 is UP: PING OK - Packet loss = 0%, RTA = 1.07 ms [18:37:59] RECOVERY - Host tools-worker-1008 is UP: PING OK - Packet loss = 0%, RTA = 2.09 ms [18:42:02] PROBLEM - Puppet run on tools-exec-1415 is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [0.0] [18:43:06] PROBLEM - Puppet run on tools-grid-master is CRITICAL: CRITICAL: 88.89% of data above the critical threshold [0.0] [18:47:00] RECOVERY - Puppet run on tools-exec-1415 is OK: OK: Less than 1.00% above the threshold [0.0] [18:50:06] !log deployment-prep started mysql on -db boxes to bring beta back online [18:50:12] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/SAL [18:51:21] !log deployment-prep armed keyholder on -tin and -mira [18:51:26] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/SAL [18:56:24] 06Labs, 10Labs-Infrastructure, 10Labs-project-other: Can't ssh into xenon.rcm.eqiad.wmflabs - https://phabricator.wikimedia.org/T149750#2764499 (10Luke081515) After some reboots, I can login now. Can somebody look why this happens? Everytime I reboot, this happens. Then I need to reboot a few times more to l... [18:58:06] RECOVERY - Puppet run on tools-grid-master is OK: OK: Less than 1.00% above the threshold [0.0] [19:00:58] 06Labs, 10Labs-Infrastructure, 10DBA, 06Operations, and 2 others: Move dbproxy1010 and dbproxy1011 to labs-support network, rename them to labsdbproxy1001 and labsdbproxy1002 - https://phabricator.wikimedia.org/T149170#2764508 (10jcrespo) Labs and DBAs agree this should go on. @robh @Cmjohnson Is this som... [19:01:05] mostly solved, I can access now [19:37:19] PROBLEM - Puppet run on tools-docker-builder-01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [19:56:47] 06Labs, 10Labs-Infrastructure, 10DBA, 06Operations: Prepare and check production and labs-side filtering for olowiki - https://phabricator.wikimedia.org/T147302#2764842 (10chasemp) 05Open>03Resolved olowiki_p exists [20:02:51] 06Labs, 10Labs-Infrastructure, 10DBA, 06Operations: Enable access to Wikipedia Tulu (tcywiki) on labs replicas - https://phabricator.wikimedia.org/T142223#2764883 (10chasemp) 05Open>03Resolved a:03chasemp _p view variant should be good to go [20:02:54] 06Labs, 10Wikimedia-Labs-General, 10DBA, 06Operations, 07Tracking: Database replication services (tracking) - https://phabricator.wikimedia.org/T50930#2764887 (10chasemp) [20:03:15] 06Labs, 10Wikimedia-Labs-General, 10DBA, 06Operations, 07Tracking: Database replication services (tracking) - https://phabricator.wikimedia.org/T50930#530760 (10chasemp) [20:03:18] 06Labs, 10Tool-Labs, 10DBA, 06Operations: Replicate wikimania2017wiki to labs - https://phabricator.wikimedia.org/T126096#2764890 (10chasemp) 05Open>03Resolved a:03chasemp _p view variant should be good to go [20:14:11] hullo all, 502 at http://tools.wmflabs.org/nppdash/patrollerinfo/ - have tried restarting the web service, nothing in qstat [20:14:32] have been trying stopping/starting on both gridengine + kubernetes [20:25:12] (03CR) 10Paladox: "@MarkTraceur" (031 comment) [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/318976 (https://phabricator.wikimedia.org/T149609) (owner: 10Paladox) [20:25:19] (03CR) 10Zppix: "@MarkTraceur" (031 comment) [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/318976 (https://phabricator.wikimedia.org/T149609) (owner: 10Paladox) [20:28:44] (03CR) 10MarkTraceur: Adds a grrrit-wm restarting command for you to type in irc (032 comments) [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/318976 (https://phabricator.wikimedia.org/T149609) (owner: 10Paladox) [20:30:07] (03PS16) 10Paladox: Adds a grrrit-wm restarting command for you to type in irc [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/318976 (https://phabricator.wikimedia.org/T149609) [20:34:30] 06Labs, 10Labs-Infrastructure, 10DBA, 07Availability: Decide between proxysql and haproxy for labsdbproxy service - https://phabricator.wikimedia.org/T149844#2766141 (10jcrespo) [20:35:10] was the maintenance to labs finished? I get something weird going on. I can connect to labs from secondary bastion, but not from bast2001.wikimedia.org [20:35:26] and to terbium it's the reverse - connects from bast2001.wikimedia.org but not from secondary [20:35:36] before it worked with bast2001.wikimedia.org for all of them [20:36:43] SMalyshev: thought that was postponed? [20:37:19] 'til the 14th? https://lists.wikimedia.org/pipermail/labs-announce/2016-November/000177.html (if its that outage/maintenance you're talking about - NFS?) [20:40:39] doesn't look like nfs-related [20:40:47] more like reboots-related [20:46:14] hi, is anyone around to help me with xsltproc on the kubernetes image for recitation-bot? the ticket for installing it has been closed but it still doesn't run for me. [20:48:04] (03PS17) 10Paladox: [WIP] Adds a grrrit-wm restarting command for you to type in irc [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/318976 (https://phabricator.wikimedia.org/T149609) [20:48:09] dfko, yuvipanda would probably be the one to talk to? [20:48:13] not sure if he's around [21:04:20] ok thanks [21:05:13] hey dk [21:05:15] err [21:05:18] hey dfko [21:05:19] I installed libxml2 [21:05:21] yo [21:05:31] since that ticket was for lxml2 right [21:05:39] I see it on labs but not from inside kubernetes [21:05:58] the webservice running on the kubernetes infrastructure needs to call it [21:06:53] dfko: is https://phabricator.wikimedia.org/T140117 what you're talking about? [21:07:14] and if you're shelling out to xsltproc, why? [21:08:53] I'm inheriting the decision to shell out to xsltproc. I can change to using the python wrapper for the library, but that incurs an additional delay [21:08:58] PROBLEM - Host tools-docker-builder-02 is DOWN: CRITICAL - Host Unreachable (10.68.23.105) [21:10:43] dfko: I am not sure if we'll support shelling out to arbitrary commands inside the kubernetes container as is now. we're making a 'trusty-legacy' container that has all the software we currently have in the grid, and that might be useful - but we're trying to not install all the things in all the containers [21:11:11] so I'm going to say your options are either to switch to the python versions of it, or wait for the trusty container (which I hope to have happen this week or next week). I recommend the former [21:11:43] ok, I can make the change then. Is there some way planned for users to customize their own containers? It would be nice to take advantage of that possibility of containerization. [21:13:41] dfko: at some point we'll have to evaluate any of the PaaS things that run on kubernetes, and they might allow you to do that. [21:13:46] no eta on when tho :( [21:13:48] sorry! [21:15:13] ok, looking forward to it then! [21:16:06] dfko: thanks for the understanding :) and do let me know if you need more libraries installed [21:33:23] RECOVERY - Host tools-secgroup-test-102 is UP: PING OK - Packet loss = 0%, RTA = 1.67 ms [21:37:46] Can someone reset my wikitech 2FA please [21:38:02] PROBLEM - Host tools-secgroup-test-102 is DOWN: CRITICAL - Host Unreachable (10.68.21.170) [21:38:10] 06Labs, 10Labs-Kubernetes, 10Tool-Labs, 06Community-Tech-Tool-Labs: Develop evaluation criteria for comparing Platform as a Service (PaaS) solutions - https://phabricator.wikimedia.org/T136265#2766442 (10bd808) Several users of the current kubernetes solution have asked for the ability to use custom Docker... [21:39:31] yuvipanda hi, is there a way i can get the pods from a terminated pod [21:39:42] since i want to view a log for a termninated pod [21:39:58] to debug why grrrit-wm keeps crashing when someone creates a certain patch [21:40:00] please? [21:41:01] tom29739: per https://wikitech.wikimedia.org/wiki/Password_reset#Reset_two_factor_authentication -- can you make a request as a text file in your $HOME on one of the bastions and then tell me where too look for it? [21:41:40] paladox: hey! grrrit-wm also writes its logs to the filesystem I think? that might be the place to look [21:41:51] Oh [21:41:58] we don't really do any log collection, so there's not an easy way to see the logs of dead pods yet [21:42:29] oh [21:47:00] bd808, it's in a file called `2fa_reset` in my home directory on bastion-01 [21:50:04] tom29739: what's your shell user name? [21:50:17] bd808, `tom29739` [21:56:42] tom29739: blerg. I don't have sudo on bastion-01. yuvipanda or madhuvishy can you quickly verify that there is a /home/tom29739/2fa_reset file on bastion-01 [21:57:00] what why do you not have sudo?! [21:57:02] that seems weird [21:57:07] but sure , I'll check [21:57:23] I don't have it Labs wide I don't think. Only in tool labs [21:57:52] * bd808 is bad at collecting hats [21:58:37] bd808: you do have it labswide, your key is in root no? [21:58:51] oh, I can't ssh into bastion-01 ofc :D [21:58:58] it disallows roots from logging in as themselves [21:59:01] * yuvipanda sshs in as root [21:59:24] bd808: have verified [21:59:40] yuvipanda: thx [22:00:29] just reposting the question from earlier - 502 at http://tools.wmflabs.org/nppdash/patrollerinfo/ - have tried restarting the web service, nothing in qstat and have been trying stopping/starting on both gridengine + kubernetes [22:01:44] hey myrcx [22:01:47] let me look [22:01:59] yuvipanda: you're a star, thank you :) [22:02:38] myrcx: so I did 'kubectl get pods' [22:02:45] it showed me there's a pod in 'CrashLoopBackoff' [22:02:53] (kubectl get pod is like qstat) [22:02:57] then I did [22:03:01] 'kubectl logs ' [22:03:04] and it gave me logs [22:03:09] 2016-11-02 22:00:19: (log.c.118) opening errorlog '/data/project/nppdash/error.log' failed: Permission denied [22:03:18] this is why it wasn't showing up in error.log [22:03:26] myrcx: I see [22:03:27] > -rw-r--r-- 1 samtar tools.nppdash 0 Oct 6 19:47 error.log [22:03:39] myrcx: if you do 'take error.log' and then restart webservice that should work [22:03:44] :D [22:04:11] yuvipanda: oh my goodness [22:04:50] well thank you! I'll bear kubectl get pods in mind, only knew qstat :P [22:05:03] myrcx: :D is new [22:06:54] yuvipanda, what user does it try to open the log file under? [22:07:30] yuvipanda: and now I can see the errors in error.log so that's helpful :) [22:10:41] tom29739: the container should run as the tool user, so tools.nppdash in this case [22:11:55] myrcx: yw [22:12:04] tom29739: the user of the tool itself [22:12:11] what bd808 said [22:16:21] 06Labs, 10Labs-Kubernetes, 10Tool-Labs, 06Community-Tech-Tool-Labs: Develop evaluation criteria for comparing Platform as a Service (PaaS) solutions - https://phabricator.wikimedia.org/T136265#2766590 (10tom29739) Being able to use custom Docker images would be extremely helpful, because it removes the nee... [22:23:09] RECOVERY - Host tools-secgroup-test-103 is UP: PING OK - Packet loss = 0%, RTA = 1.14 ms [22:25:38] PROBLEM - Host tools-secgroup-test-103 is DOWN: CRITICAL - Host Unreachable (10.68.21.22) [22:40:15] 10Tool-Labs-tools-Other: create tool to crunch metrics for views (play started) of video and audio files - https://phabricator.wikimedia.org/T116363#2766645 (10harej-NIOSH) All the past data has now been ingested, 1 January 2015 to 1 November 2016. There will be daily ingests that take place at around 20:00 UTC... [23:01:40] RECOVERY - Host secgroup-lag-102 is UP: PING OK - Packet loss = 0%, RTA = 0.41 ms [23:01:59] 06Labs, 10Gerrit, 10grrrit-wm: Create grrrit-wm test instance - https://phabricator.wikimedia.org/T149529#2766719 (10greg) >>! In T149529#2760183, @Paladox wrote: > Ive managed to create a instance on the git project. It is a small instance. So, resolved? [23:02:47] 06Labs, 10Gerrit, 10grrrit-wm: Create grrrit-wm test instance - https://phabricator.wikimedia.org/T149529#2766720 (10Paladox) 05Open>03Resolved Yep. [23:04:40] PROBLEM - Host secgroup-lag-102 is DOWN: CRITICAL - Host Unreachable (10.68.17.218) [23:14:33] 10Labs-project-other: Successful pilot of Discourse on https://discourse.wmflabs.org/ as an alternative to wikimedia-l mailinglist - https://phabricator.wikimedia.org/T124690#2766818 (10Samwilson) It's up again now (but only recently I think?).