[01:16:43] Hi, I know tools are supposed to be read only but should this affect ssh login? [01:16:52] It seems I cannot login any more with ssh [01:17:30] debug1: Connecting to tools-login.wmflabs.org [208.80.155.163] port 22. [01:17:36] ... [01:17:41] debug1: Authentications that can continue: publickey,keyboard-interactive,hostbased [01:17:41] debug1: Next authentication method: publickey [01:17:41] debug1: Offering RSA public key: /home/hr/.ssh/id_rsa [01:17:42] debug1: Server accepts key: pkalg ssh-rsa blen 279 [01:17:43] Connection closed by 208.80.155.163 [01:17:59] hroest: Same problem here. Word is that they're working on it. [01:18:01] anybody else have similar issuse? [01:18:03] ah ok, [01:18:10] at least its not just me :-) [01:18:26] well I wish them good luck [01:18:31] $ ssh matthewrbowker@login.tools.wmflabs.org [01:18:31] Connection to login.tools.wmflabs.org closed by remote host. [01:18:31] Connection to login.tools.wmflabs.org closed. [01:56:43] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/ChongDae was created, changed by ChongDae link https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Access_Request/ChongDae edit summary: Created page with "{{Tools Access Request |Justification=Running Bot for kowiki |Completed=false |User Name=ChongDae }}" [02:14:28] For what it's worth, we're all four of us frantically jamming the pieces of toollabs back together :( [02:31:03] !log tools reboot tools-exec-1406 [02:31:06] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [02:56:02] !log tools reboot tools-exec-1405 to ensure noauto works (because atboot=>false is lies) [02:56:06] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [03:21:56] !log tools reboot tools-checker-01 [03:21:59] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [04:31:59] !log tools.nagf Restarted webservice to get access to r/w NFS [04:32:01] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.nagf/SAL [04:39:58] bd808: hmm, was that needed? [04:39:59] interesting [04:40:08] maybe I should do a restart of all webservices once we come back [04:40:24] yuvipanda: yeah. it had an error message about not being able to write to a cache dir [04:41:30] hey have we gotten enough jobs off of precise nodes to start shutting some of them down? [04:46:42] bd808: yeah [04:47:15] bd808: not today, but I think we can cut out 50% of them maybe in a few days [04:48:45] HaeB: paws is back now [04:49:15] saw it, thanks :) [04:59:26] yuvipanda: are you gonna do an webservice restart [04:59:39] all webservice* [05:03:02] madhuvishy: considering, but also hungry. [05:03:23] yuvipanda: i was wondering if i should send email or wait [05:03:27] till you do that [05:03:41] madhuvishy: ok, I'll do it now [05:03:52] we also have to switch back to the no /dev/null writing thing no? [05:04:04] yuvipanda: ^ [05:04:22] madhuvishy: nah, that I want to do tomorrow only [05:04:28] yuvipanda: okay cool [05:04:30] madhuvishy: in case stuff goes south. don't want 50g of logs [05:04:33] yes [05:04:49] thanks for the hard work guys! everything is running smoothly now [05:05:13] yay thanks musikanimal :) [05:05:23] I did notice my app is taking an unusually long time to update after I pulled in new code. Not sure if that's related to the maintenance [05:05:53] !log tools restarting all webservices on gridengine [05:05:56] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [05:05:58] musikanimal: what do you mean by 'update'? [05:06:20] madhuvishy: ok, I'll wait for this to finish and then start k8s. [05:06:25] I git pulled in some new HTML (and some JS), it's there in public_html directory but not in my browser [05:06:28] I've cleared the cache, etc. [05:07:10] let me try a different app, see if it's isolated to just this one [05:07:15] musikanimal: let us know if it's repeatable? [05:07:17] yeah okay [05:11:34] madhuvishy: yeah it seems none of them are updating. They're running on k8s [05:11:45] you should see a big "BLAH BLAH" instead of "Langviews Analysis" http://tools.wmflabs.org/langviews-test/ [05:12:03] there I edited the index.php directly [05:15:44] musikanimal: ah hmmm [05:16:53] musikanimal: can you try restarting the webservice? [05:17:23] tools.wmflabs.org is down atm [05:17:38] chasemp: probably going through the webservice restart cycle? [05:17:41] chasemp: probably momentary [05:17:45] yeah - it's up now [05:17:46] chasemp: and seems up to e too [05:17:55] Ah ok [05:20:40] !log tools restart all k8s webservices too [05:20:43] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [05:21:56] musikanimal: restart seems to have brought it back to liveness [05:22:00] musikanimal: try again? [05:22:07] I see BLAH BLAH! [05:22:09] :) [05:22:12] okay [05:22:25] so sorry for my ignorance, with k8s I can still do `webservice restart`? [05:23:52] musikanimal: yes [05:24:31] okay cool. Thanks! [05:26:19] i brought shinken back [05:27:47] yuvipanda: let me know when all restarts are done? [05:28:16] madhuvishy: yup will do [05:28:26] thanks! [05:29:11] madhuvishy: in the email you send out, mention that logs are still not being written until tomorrow? [05:29:17] yuvipanda: yes [05:29:25] madhuvishy: thanks [05:44:35] yuvipanda: Hm.. labs grafana/graphite is still a bit of a roulette. Getting the project dashboard to show 3 graphs is hard. Usually 2/3 or 3/3 are broken. [05:44:44] * Krinkle keeps refreshing until all three render.. [05:44:49] https://grafana-labs.wikimedia.org/dashboard/db/labs-project-board?from=now-3h&to=now&var-project=cvn&var-server=cvn-app4&var-server=cvn-app6 [06:25:36] ssh back, but php can't work? [06:30:59] Shizhao: what issue are you facing? [06:35:55] labs reboot, php not work, see https://tools.wmflabs.org/pub/ [06:36:23] I'm not change code [06:36:26] that page works for me? [06:37:06] just html work [06:38:47] 3 hours ago page is ok, now no [06:39:44] error.log no errors were logged [06:39:53] yeah, logs won't work for another 12h [06:40:11] :( [06:41:00] Is maintenance caused? [06:42:18] it shouldn't have affected it in any way, but it'll be hard to debug without access to error log [06:43:20] Shizhao: so I'd reccomend waiting for the logs to come back in ~12h before attempting to debug [06:43:33] Shizhao: I'm also going to just try restarting it, to see if that helps [06:45:10] Shizhao: try now? [06:45:22] !log tools.pub restarted with webservice stop && webservice --backend=kubernetes start [06:45:24] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.pub/SAL [06:51:03] thx [06:54:09] Shizhao: does it work for you? [07:05:57] yuvipanda: no :( [07:07:21] Shizhao: alright. let's check back tomorrow then [07:10:20] en [07:14:56] It seems that all php internal functions are not working [07:15:32] curl is ok [07:34:34] PROBLEM - Free space - all mounts on tools-docker-builder-03 is CRITICAL: CRITICAL: tools.tools-docker-builder-03.diskspace.root.byte_percentfree (<10.00%) [15:47:18] Is there a way I can get the stdout and stderr of 'generic' webservice jobs? I saw https://gerrit.wikimedia.org/r/319798 got merged yesterday which looks like it might be relevant but perhaps output wasn't being logged to ~/error.log before this either [15:59:53] tarrow: We are not logging those, I sent a note last night. We'll switch it back within the next couple hours [16:04:10] Ah cool, so once you switch it back I should see output in ~/error.log again? [16:12:27] tarrow: yes [16:18:25] hi yuvipanda and madhuvishy: I'm getting a "502 Bad Gateway" error when attempting to start a PAWS server. is PAWS still down? [16:18:39] PROBLEM - Puppet run on tools-mail-01 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [16:21:26] zareen: shouldn't be - i'm looking [16:21:54] RECOVERY - Puppet staleness on tools-puppetmaster-02 is OK: OK: Less than 1.00% above the threshold [3600.0] [16:22:01] zareen: it looks fine to me now [16:22:20] could you check again? [16:22:22] RECOVERY - Puppet staleness on tools-docker-builder-03 is OK: OK: Less than 1.00% above the threshold [3600.0] [16:23:04] RECOVERY - Puppet staleness on tools-elastic-01 is OK: OK: Less than 1.00% above the threshold [3600.0] [16:23:06] RECOVERY - Puppet staleness on tools-flannel-etcd-01 is OK: OK: Less than 1.00% above the threshold [3600.0] [16:23:14] RECOVERY - Puppet staleness on tools-docker-registry-01 is OK: OK: Less than 1.00% above the threshold [3600.0] [16:24:36] RECOVERY - Puppet staleness on tools-elastic-03 is OK: OK: Less than 1.00% above the threshold [3600.0] [16:25:05] madhuvishy: still not working for me [16:25:16] RECOVERY - Puppet staleness on tools-mail-01 is OK: OK: Less than 1.00% above the threshold [3600.0] [16:26:08] RECOVERY - Puppet staleness on tools-k8s-etcd-03 is OK: OK: Less than 1.00% above the threshold [3600.0] [16:26:41] zareen: can you try hard reloading? i am on phone, can look in a bit [16:26:51] RECOVERY - Puppet staleness on tools-elastic-02 is OK: OK: Less than 1.00% above the threshold [3600.0] [16:27:17] RECOVERY - Puppet staleness on tools-prometheus-01 is OK: OK: Less than 1.00% above the threshold [3600.0] [16:27:17] RECOVERY - Puppet staleness on tools-k8s-etcd-01 is OK: OK: Less than 1.00% above the threshold [3600.0] [16:27:45] RECOVERY - Puppet staleness on tools-proxy-02 is OK: OK: Less than 1.00% above the threshold [3600.0] [16:27:51] RECOVERY - Puppet staleness on tools-flannel-etcd-03 is OK: OK: Less than 1.00% above the threshold [3600.0] [16:28:35] RECOVERY - Puppet staleness on tools-proxy-01 is OK: OK: Less than 1.00% above the threshold [3600.0] [16:28:41] RECOVERY - Puppet staleness on tools-logs-02 is OK: OK: Less than 1.00% above the threshold [3600.0] [16:28:45] RECOVERY - Puppet staleness on tools-prometheus-02 is OK: OK: Less than 1.00% above the threshold [3600.0] [16:28:45] RECOVERY - Puppet staleness on tools-flannel-etcd-02 is OK: OK: Less than 1.00% above the threshold [3600.0] [16:29:37] RECOVERY - Puppet staleness on tools-redis-1002 is OK: OK: Less than 1.00% above the threshold [3600.0] [16:29:53] RECOVERY - Puppet staleness on tools-redis-1001 is OK: OK: Less than 1.00% above the threshold [3600.0] [16:30:52] RECOVERY - Puppet staleness on tools-k8s-etcd-02 is OK: OK: Less than 1.00% above the threshold [3600.0] [16:31:03] madhuvishy: hard reload didn't fix it [16:31:11] zareen: ok looking now [16:33:38] RECOVERY - Puppet run on tools-mail-01 is OK: OK: Less than 1.00% above the threshold [0.0] [17:07:45] bd808: sorry to bug you. If you have a chance could you take a look at T149709? [17:07:45] T149709: Possible use of tools-lab-elasticsearch cluster - https://phabricator.wikimedia.org/T149709 [17:11:39] tarrow: man. I just keep forgetting about you. :( I'll run the "fancy" new process today and get you your creds. [17:12:30] zareen: sorry, can you check now? [17:13:17] madhuvishy: all good now, thanks for the help [17:13:26] zareen: cool :) np! [17:24:01] PROBLEM - Puppet run on tools-services-02 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [17:26:57] madhuvishy: what did you do to fix? [17:28:09] PROBLEM - Puppet run on tools-exec-1419 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [17:28:11] PROBLEM - Puppet run on tools-webgrid-lighttpd-1207 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [17:28:17] PROBLEM - Puppet run on tools-webgrid-lighttpd-1413 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [17:29:23] chasemp: paws? I logged in as the pod tool, did kubectl get pods - it didn't have a pod for zareen - but i just deleted the hub and proxy pods and they respawned [17:29:33] paws tool* [17:29:40] PROBLEM - Puppet run on tools-exec-1408 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [17:29:54] i realized i didn't know anything about this setup on k8s [17:31:54] PROBLEM - Puppet run on tools-exec-1213 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [17:32:30] PROBLEM - Puppet run on tools-webgrid-lighttpd-1406 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [17:32:54] PROBLEM - Puppet run on tools-exec-1211 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [17:32:58] PROBLEM - Puppet run on tools-exec-1403 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [17:33:02] PROBLEM - Puppet run on tools-webgrid-generic-1403 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [17:33:20] PROBLEM - Puppet run on tools-webgrid-generic-1404 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [17:33:32] PROBLEM - Puppet run on tools-exec-1204 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [17:33:48] PROBLEM - Puppet run on tools-exec-1402 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [17:35:00] PROBLEM - Puppet run on tools-exec-1202 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [17:35:07] PROBLEM - Puppet run on tools-exec-1221 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [17:35:13] PROBLEM - Puppet run on tools-exec-1413 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [17:35:49] PROBLEM - Puppet run on tools-bastion-05 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [17:36:13] PROBLEM - Puppet run on tools-webgrid-lighttpd-1405 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [17:36:19] PROBLEM - Puppet run on tools-exec-1206 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [17:37:13] PROBLEM - Puppet run on tools-webgrid-lighttpd-1206 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [17:38:15] PROBLEM - Puppet run on tools-precise-dev is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [17:38:39] PROBLEM - Puppet run on tools-webgrid-lighttpd-1418 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [17:38:43] PROBLEM - Puppet run on tools-exec-1411 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [17:39:09] PROBLEM - Puppet run on tools-exec-1407 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [17:39:33] PROBLEM - Puppet run on tools-exec-1414 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [17:39:39] PROBLEM - Puppet run on tools-exec-1416 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [17:40:31] PROBLEM - Puppet run on tools-webgrid-lighttpd-1403 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [17:42:06] PROBLEM - Puppet run on tools-exec-1210 is CRITICAL: CRITICAL: 62.50% of data above the critical threshold [0.0] [17:42:26] PROBLEM - Puppet run on tools-checker-02 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [17:43:10] PROBLEM - Puppet run on tools-webgrid-lighttpd-1412 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [17:43:52] PROBLEM - Puppet run on tools-exec-1214 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [17:44:16] PROBLEM - Puppet run on tools-webgrid-lighttpd-1205 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [17:44:19] PROBLEM - Puppet run on tools-exec-1201 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [17:44:20] PROBLEM - Puppet run on tools-mail is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [17:45:24] PROBLEM - Puppet run on tools-exec-1208 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [17:46:20] PROBLEM - Puppet run on tools-exec-1410 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [17:46:50] PROBLEM - Puppet run on tools-exec-1217 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [17:47:11] PROBLEM - Puppet run on tools-webgrid-lighttpd-1210 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [17:47:13] PROBLEM - Puppet run on tools-webgrid-lighttpd-1209 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [17:47:29] PROBLEM - Puppet run on tools-exec-1409 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [17:47:41] PROBLEM - Puppet run on tools-exec-1216 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [17:47:43] PROBLEM - Puppet run on tools-bastion-02 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [17:47:53] PROBLEM - Puppet run on tools-exec-gift is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [17:48:18] andrewbogott: is this ldap issues^? [17:48:21] PROBLEM - Puppet run on tools-exec-1412 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [17:48:57] PROBLEM - Puppet run on tools-exec-1418 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [17:49:01] PROBLEM - Puppet run on tools-webgrid-lighttpd-1408 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [17:49:07] PROBLEM - Puppet run on tools-webgrid-lighttpd-1202 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [17:49:15] PROBLEM - Puppet run on tools-exec-1212 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [17:49:23] PROBLEM - Puppet run on tools-bastion-03 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [17:49:51] PROBLEM - Puppet run on tools-webgrid-lighttpd-1409 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [17:49:53] PROBLEM - Puppet run on tools-exec-1220 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [17:50:11] PROBLEM - Puppet run on tools-cron-01 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [17:50:20] PROBLEM - Puppet run on tools-webgrid-lighttpd-1416 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [17:50:20] I no longer think it's ldap, but yeah, it's all those puppet failures that I was trying to investigate [17:50:38] PROBLEM - Puppet run on tools-webgrid-lighttpd-1415 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [17:51:12] PROBLEM - Puppet run on tools-webgrid-lighttpd-1410 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [17:51:56] PROBLEM - Puppet run on tools-webgrid-generic-1402 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [17:53:16] PROBLEM - Puppet run on tools-exec-1420 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [17:54:16] PROBLEM - Puppet run on tools-exec-1207 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [17:54:24] PROBLEM - Puppet run on tools-webgrid-lighttpd-1411 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [17:54:40] PROBLEM - Puppet run on tools-exec-1209 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [17:55:09] PROBLEM - Puppet run on tools-webgrid-lighttpd-1201 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [17:55:13] PROBLEM - Puppet run on tools-exec-1205 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [17:55:15] PROBLEM - Puppet run on tools-exec-1203 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [17:56:02] PROBLEM - Puppet run on tools-exec-1415 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [17:57:12] PROBLEM - Puppet run on tools-webgrid-lighttpd-1203 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [17:57:32] PROBLEM - Puppet run on tools-webgrid-lighttpd-1208 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [17:57:36] PROBLEM - Puppet run on tools-webgrid-lighttpd-1204 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [17:57:38] PROBLEM - Puppet run on tools-exec-1215 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [17:57:50] PROBLEM - Puppet run on tools-webgrid-lighttpd-1404 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [17:58:08] PROBLEM - Puppet run on tools-webgrid-lighttpd-1402 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [17:58:08] PROBLEM - Puppet run on tools-webgrid-lighttpd-1414 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [17:58:24] PROBLEM - Puppet run on tools-exec-1219 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [17:59:02] PROBLEM - Puppet run on tools-exec-1218 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [17:59:12] PROBLEM - Puppet run on tools-webgrid-generic-1401 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [17:59:36] PROBLEM - Puppet run on tools-webgrid-lighttpd-1407 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [18:02:45] bd808: thanks! I really appreciate it :-) [18:03:55] tarrow: let me know if you run into problems. I've been the only user of this elasticsearch cluster so far so you are a bit of a test case :) [18:04:01] RECOVERY - Puppet run on tools-services-02 is OK: OK: Less than 1.00% above the threshold [0.0] [18:08:55] are atm nfs problems? [18:09:24] RECOVERY - Puppet run on tools-bastion-03 is OK: OK: Less than 1.00% above the threshold [0.0] [18:12:23] Steinsplitter: can you be more specific? [18:13:01] Steinsplitter: i dont believe so probably just servers that are restarting or puppets not being used by puppetmasters atm [18:13:12] RECOVERY - Puppet run on tools-exec-1420 is OK: OK: Less than 1.00% above the threshold [0.0] [18:14:02] chasemp: i get errors with strange phats such as /mnt/nfs/labstore-secondary-tools-project/ and i can't git pull etc. just wondering if nfs problem atm otherwise i need to find out what is broken now. [18:14:21] I'm not sure what this means 'strange phats' [18:14:44] but that path works for me w/ my own tool on tools-bastion-03 [18:14:56] that is the mount that provides /data/project fyi [18:16:22] How do i move between bastions? [18:16:42] logout of one and into another [18:17:31] I just get 02 [18:17:49] zppix|mobile: try direct: tools-bastion-03.eqiad.wmflabs [18:18:06] Ah ok i was just curious [18:18:09] RECOVERY - Puppet run on tools-exec-1419 is OK: OK: Less than 1.00% above the threshold [0.0] [18:18:19] RECOVERY - Puppet run on tools-exec-1412 is OK: OK: Less than 1.00% above the threshold [0.0] [18:18:43] RECOVERY - Puppet run on tools-exec-1411 is OK: OK: Less than 1.00% above the threshold [0.0] [18:18:49] RECOVERY - Puppet run on tools-exec-1402 is OK: OK: Less than 1.00% above the threshold [0.0] [18:19:09] RECOVERY - Puppet run on tools-exec-1407 is OK: OK: Less than 1.00% above the threshold [0.0] [18:19:11] RECOVERY - Puppet run on tools-webgrid-generic-1401 is OK: OK: Less than 1.00% above the threshold [0.0] [18:19:31] RECOVERY - Puppet run on tools-exec-1414 is OK: OK: Less than 1.00% above the threshold [0.0] [18:19:37] RECOVERY - Puppet run on tools-exec-1408 is OK: OK: Less than 1.00% above the threshold [0.0] [18:19:39] RECOVERY - Puppet run on tools-exec-1416 is OK: OK: Less than 1.00% above the threshold [0.0] [18:19:53] Hey bd808 stashbot needs rejoined to my chan again [18:20:12] RECOVERY - Puppet run on tools-exec-1413 is OK: OK: Less than 1.00% above the threshold [0.0] [18:20:31] zppix|mobile: restarting now.... [18:20:34] RECOVERY - Puppet run on tools-webgrid-lighttpd-1403 is OK: OK: Less than 1.00% above the threshold [0.0] [18:20:38] RECOVERY - Puppet run on tools-webgrid-lighttpd-1415 is OK: OK: Less than 1.00% above the threshold [0.0] [18:21:02] RECOVERY - Puppet run on tools-exec-1415 is OK: OK: Less than 1.00% above the threshold [0.0] [18:21:10] RECOVERY - Puppet run on tools-webgrid-lighttpd-1410 is OK: OK: Less than 1.00% above the threshold [0.0] [18:21:12] RECOVERY - Puppet run on tools-webgrid-lighttpd-1405 is OK: OK: Less than 1.00% above the threshold [0.0] [18:21:20] RECOVERY - Puppet run on tools-exec-1410 is OK: OK: Less than 1.00% above the threshold [0.0] [18:22:30] RECOVERY - Puppet run on tools-exec-1409 is OK: OK: Less than 1.00% above the threshold [0.0] [18:22:46] RECOVERY - Puppet run on tools-webgrid-lighttpd-1404 is OK: OK: Less than 1.00% above the threshold [0.0] [18:23:00] RECOVERY - Puppet run on tools-exec-1403 is OK: OK: Less than 1.00% above the threshold [0.0] [18:23:08] RECOVERY - Puppet run on tools-webgrid-lighttpd-1402 is OK: OK: Less than 1.00% above the threshold [0.0] [18:23:40] RECOVERY - Puppet run on tools-webgrid-lighttpd-1418 is OK: OK: Less than 1.00% above the threshold [0.0] [18:23:47] chasemp: \o/ wors again. likely just a machine was bus. thanks. [18:23:55] *k [18:24:08] PROBLEM - Puppet run on tools-exec-1419 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [18:24:12] PROBLEM - Puppet run on tools-exec-1420 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [18:24:18] PROBLEM - Puppet run on tools-exec-1412 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [18:24:26] PROBLEM - Puppet run on tools-webgrid-lighttpd-1401 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [18:24:41] PROBLEM - Puppet run on tools-exec-1417 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [18:24:43] PROBLEM - Puppet run on tools-exec-1411 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [18:25:09] PROBLEM - Puppet run on tools-exec-1407 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [18:25:13] PROBLEM - Puppet run on tools-webgrid-generic-1401 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [18:25:19] RECOVERY - Puppet run on tools-webgrid-lighttpd-1416 is OK: OK: Less than 1.00% above the threshold [0.0] [18:25:35] PROBLEM - Puppet run on tools-exec-1414 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [18:26:13] PROBLEM - Puppet run on tools-exec-1413 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [18:26:33] PROBLEM - Puppet run on tools-webgrid-lighttpd-1403 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [18:27:01] PROBLEM - Puppet run on tools-exec-1415 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [18:27:09] PROBLEM - Puppet run on tools-webgrid-lighttpd-1410 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [18:27:11] PROBLEM - Puppet run on tools-webgrid-lighttpd-1405 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [18:32:12] RECOVERY - Puppet run on tools-webgrid-lighttpd-1405 is OK: OK: Less than 1.00% above the threshold [0.0] [18:33:04] RECOVERY - Puppet run on tools-webgrid-generic-1403 is OK: OK: Less than 1.00% above the threshold [0.0] [18:33:06] RECOVERY - Puppet run on tools-webgrid-lighttpd-1414 is OK: OK: Less than 1.00% above the threshold [0.0] [18:33:20] RECOVERY - Puppet run on tools-webgrid-generic-1404 is OK: OK: Less than 1.00% above the threshold [0.0] [18:33:58] RECOVERY - Puppet run on tools-webgrid-lighttpd-1408 is OK: OK: Less than 1.00% above the threshold [0.0] [18:34:24] RECOVERY - Puppet run on tools-webgrid-lighttpd-1401 is OK: OK: Less than 1.00% above the threshold [0.0] [18:34:34] RECOVERY - Puppet run on tools-webgrid-lighttpd-1407 is OK: OK: Less than 1.00% above the threshold [0.0] [18:35:12] RECOVERY - Puppet run on tools-webgrid-generic-1401 is OK: OK: Less than 1.00% above the threshold [0.0] [18:36:32] RECOVERY - Puppet run on tools-webgrid-lighttpd-1403 is OK: OK: Less than 1.00% above the threshold [0.0] [18:36:48] looks like the cron box is broken [18:36:57] RECOVERY - Puppet run on tools-webgrid-generic-1402 is OK: OK: Less than 1.00% above the threshold [0.0] [18:37:01] RECOVERY - Puppet run on tools-exec-1415 is OK: OK: Less than 1.00% above the threshold [0.0] [18:37:09] RECOVERY - Puppet run on tools-webgrid-lighttpd-1410 is OK: OK: Less than 1.00% above the threshold [0.0] [18:37:09] error: SGE_ROOT directory "/var/lib/gridengine" doesn't exist [18:37:28] Betacommand: ok thanks, somewhat of a known freak thing we are cleaning up [18:37:29] RECOVERY - Puppet run on tools-webgrid-lighttpd-1406 is OK: OK: Less than 1.00% above the threshold [0.0] [18:37:30] I'll hit that next [18:37:57] Is kubectl affected? [18:38:04] shouldn't be at all [18:38:07] chasemp: np. thought I would let you know since I havent seen any outage emails [18:38:11] RECOVERY - Puppet run on tools-webgrid-lighttpd-1412 is OK: OK: Less than 1.00% above the threshold [0.0] [18:38:17] RECOVERY - Puppet run on tools-webgrid-lighttpd-1413 is OK: OK: Less than 1.00% above the threshold [0.0] [18:38:38] it's entirely internal grid engine house keeping shenanigans [18:39:09] RECOVERY - Puppet run on tools-exec-1419 is OK: OK: Less than 1.00% above the threshold [0.0] [18:39:23] RECOVERY - Puppet run on tools-webgrid-lighttpd-1411 is OK: OK: Less than 1.00% above the threshold [0.0] [18:39:51] RECOVERY - Puppet run on tools-webgrid-lighttpd-1409 is OK: OK: Less than 1.00% above the threshold [0.0] [18:41:09] RECOVERY - Puppet run on tools-exec-1413 is OK: OK: Less than 1.00% above the threshold [0.0] [18:43:18] Betacommand: cron should be ok [18:44:16] RECOVERY - Puppet run on tools-exec-1212 is OK: OK: Less than 1.00% above the threshold [0.0] [18:44:42] RECOVERY - Puppet run on tools-exec-1411 is OK: OK: Less than 1.00% above the threshold [0.0] [18:45:08] RECOVERY - Puppet run on tools-exec-1407 is OK: OK: Less than 1.00% above the threshold [0.0] [18:45:34] RECOVERY - Puppet run on tools-exec-1414 is OK: OK: Less than 1.00% above the threshold [0.0] [18:49:18] RECOVERY - Puppet run on tools-exec-1201 is OK: OK: Less than 1.00% above the threshold [0.0] [18:49:41] RECOVERY - Puppet run on tools-exec-1417 is OK: OK: Less than 1.00% above the threshold [0.0] [18:54:19] RECOVERY - Puppet run on tools-exec-1412 is OK: OK: Less than 1.00% above the threshold [0.0] [18:55:13] RECOVERY - Puppet run on tools-cron-01 is OK: OK: Less than 1.00% above the threshold [0.0] [18:57:16] !log tools.wikibugs restarted wikibugs [18:57:19] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.wikibugs/SAL [18:57:59] 06Labs, 10DBA: Recover/Rename p50380g51020_perfectbot on replica LabsDBs or toolDBs - https://phabricator.wikimedia.org/T150659#2796674 (10jcrespo) 05Open>03declined I apologize again. [18:58:56] RECOVERY - Puppet run on tools-exec-1418 is OK: OK: Less than 1.00% above the threshold [0.0] [18:59:12] RECOVERY - Puppet run on tools-exec-1420 is OK: OK: Less than 1.00% above the threshold [0.0] [19:05:01] RECOVERY - Puppet run on tools-exec-1202 is OK: OK: Less than 1.00% above the threshold [0.0] [19:05:15] RECOVERY - Puppet run on tools-exec-1203 is OK: OK: Less than 1.00% above the threshold [0.0] [19:06:03] 06Labs, 10Labs-Infrastructure, 07LDAP: Remove shell user "80686" - https://phabricator.wikimedia.org/T63967#2796723 (10demon) >>! In T63967#2600758, @MoritzMuehlenhoff wrote: > validnames is a configuration setting of nslcd and configured via a regex in puppet. There's a comment that the regex must be kept i... [19:10:55] 06Labs, 10Gerrit, 10wikitech.wikimedia.org, 07LDAP: Alter full name on Gerrit - https://phabricator.wikimedia.org/T149976#2796754 (10demon) 05Open>03Resolved Sorry for the delay, all done. Please note, your shell name remains `gryllida` as those are not changed. Your display name and what you login to... [19:23:29] RECOVERY - Puppet run on tools-exec-1204 is OK: OK: Less than 1.00% above the threshold [0.0] [19:25:13] RECOVERY - Puppet run on tools-exec-1205 is OK: OK: Less than 1.00% above the threshold [0.0] [19:26:18] RECOVERY - Puppet run on tools-exec-1206 is OK: OK: Less than 1.00% above the threshold [0.0] [19:29:16] RECOVERY - Puppet run on tools-exec-1207 is OK: OK: Less than 1.00% above the threshold [0.0] [19:30:10] RECOVERY - Puppet run on tools-webgrid-lighttpd-1201 is OK: OK: Less than 1.00% above the threshold [0.0] [19:35:23] RECOVERY - Puppet run on tools-exec-1208 is OK: OK: Less than 1.00% above the threshold [0.0] [19:37:05] RECOVERY - Puppet run on tools-exec-1210 is OK: OK: Less than 1.00% above the threshold [0.0] [19:39:41] RECOVERY - Puppet run on tools-exec-1209 is OK: OK: Less than 1.00% above the threshold [0.0] [19:41:54] RECOVERY - Puppet run on tools-exec-1213 is OK: OK: Less than 1.00% above the threshold [0.0] [19:42:12] RECOVERY - Puppet run on tools-webgrid-lighttpd-1210 is OK: OK: Less than 1.00% above the threshold [0.0] [19:42:32] RECOVERY - Puppet run on tools-webgrid-lighttpd-1208 is OK: OK: Less than 1.00% above the threshold [0.0] [19:42:36] RECOVERY - Puppet run on tools-webgrid-lighttpd-1204 is OK: OK: Less than 1.00% above the threshold [0.0] [19:42:54] RECOVERY - Puppet run on tools-exec-1211 is OK: OK: Less than 1.00% above the threshold [0.0] [19:43:12] RECOVERY - Puppet run on tools-webgrid-lighttpd-1207 is OK: OK: Less than 1.00% above the threshold [0.0] [19:44:10] RECOVERY - Puppet run on tools-webgrid-lighttpd-1202 is OK: OK: Less than 1.00% above the threshold [0.0] [19:44:12] RECOVERY - Puppet run on tools-webgrid-lighttpd-1205 is OK: OK: Less than 1.00% above the threshold [0.0] [19:47:11] RECOVERY - Puppet run on tools-webgrid-lighttpd-1203 is OK: OK: Less than 1.00% above the threshold [0.0] [19:47:13] RECOVERY - Puppet run on tools-webgrid-lighttpd-1209 is OK: OK: Less than 1.00% above the threshold [0.0] [19:47:39] RECOVERY - Puppet run on tools-exec-1215 is OK: OK: Less than 1.00% above the threshold [0.0] [19:47:41] RECOVERY - Puppet run on tools-exec-1216 is OK: OK: Less than 1.00% above the threshold [0.0] [19:48:32] is it normal that i have jsub error on dev.tools but not on login.tools ? [19:48:32] error: SGE_ROOT directory "/var/lib/gridengine" doesn't exist [19:48:51] Framawiki: no it's a known issue we are dealing w/ [19:48:55] RECOVERY - Puppet run on tools-exec-1214 is OK: OK: Less than 1.00% above the threshold [0.0] [19:49:05] Framawiki: what host is that? [19:51:36] tools.framabot@tools-bastion-02 [19:51:46] dev.login [19:51:49] RECOVERY - Puppet run on tools-exec-1217 is OK: OK: Less than 1.00% above the threshold [0.0] [19:51:52] !log reboot tools-precise-dev [19:52:11] Unknown project "reboot" [19:52:20] !log tools reboot tools-precise-dev [19:52:22] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [19:52:29] Framawiki: sure I'll handle it, will require a quick reboot tho [19:52:38] thanks you [19:54:02] RECOVERY - Puppet run on tools-exec-1218 is OK: OK: Less than 1.00% above the threshold [0.0] [19:58:22] RECOVERY - Puppet run on tools-exec-1219 is OK: OK: Less than 1.00% above the threshold [0.0] [19:59:52] RECOVERY - Puppet run on tools-exec-1220 is OK: OK: Less than 1.00% above the threshold [0.0] [20:00:06] RECOVERY - Puppet run on tools-exec-1221 is OK: OK: Less than 1.00% above the threshold [0.0] [20:12:26] RECOVERY - Puppet run on tools-checker-02 is OK: OK: Less than 1.00% above the threshold [0.0] [20:13:14] RECOVERY - Puppet run on tools-precise-dev is OK: OK: Less than 1.00% above the threshold [0.0] [20:15:02] PROBLEM - Puppet run on tools-services-02 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [20:15:38] PROBLEM - Host tools-puppetmaster-01 is DOWN: CRITICAL - Host Unreachable (10.68.22.61) [20:17:13] RECOVERY - Puppet run on tools-webgrid-lighttpd-1206 is OK: OK: Less than 1.00% above the threshold [0.0] [20:17:53] RECOVERY - Puppet run on tools-exec-gift is OK: OK: Less than 1.00% above the threshold [0.0] [20:19:19] RECOVERY - Puppet run on tools-mail is OK: OK: Less than 1.00% above the threshold [0.0] [20:20:42] run now, thanks ! [20:22:42] RECOVERY - Puppet run on tools-bastion-02 is OK: OK: Less than 1.00% above the threshold [0.0] [20:40:46] RECOVERY - Puppet run on tools-bastion-05 is OK: OK: Less than 1.00% above the threshold [0.0] [21:00:02] RECOVERY - Puppet run on tools-services-02 is OK: OK: Less than 1.00% above the threshold [0.0] [21:01:09] 06Labs, 10Labs-Infrastructure, 10DBA, 07Epic, 07Tracking: Labs databases rearchitecture (tracking) - https://phabricator.wikimedia.org/T140788#2797093 (10jcrespo) [21:05:51] yuvipanda hi im wondering if we could switch wikibugs over to kubunetes as long as it is ok with legoktm and valhallasw`vecto please? [21:06:02] It would make restarting the bot easyer. [21:06:19] I'm not a wikibugs maintainer, so you've to ask them and not me :) [21:06:47] Oh ok [21:06:49] sorry [21:09:07] 06Labs, 10Labs-Infrastructure, 10DBA: Provision db1095 with at least 1 shard, sanitize and test slave-side triggers - https://phabricator.wikimedia.org/T150802#2797122 (10jcrespo) [21:10:53] Paladox: uh, why would that make restarting easier? [21:11:42] According to https://www.mediawiki.org/wiki/Wikibugs#Deploying_changes you have to install fab on your local pc [21:11:45] But no, I'd rather not. The current infrastructure is built around SGE [21:11:48] Ok [21:11:50] Yes [21:11:52] So? [21:12:14]