[00:01:31] I see… “emmanuel@emmanuel-Inspiron-1525” and “rsa-key-20140902” as the two registered keys. Is one of those the one you’re using? [00:01:58] If you are having access problems, please see: https://wikitech.wikimedia.org/wiki/Access#Accessing_public_and_private_instances [00:01:58] Permission denied (publickey,hostbased). [00:01:59] If you are having access problems, please see: https://wikitech.wikimedia.org/wiki/Access#Accessing_public_and_private_instances [00:01:59] Permission denied (publickey,hostbased). [00:01:59] yes [00:02:03] sorry [00:02:05] yes [00:04:29] Baccanal: do you see those “no such identity: /home/emmanuel/.ssh/id_dsa: No such file or directory” messages in your -vvv output? [00:04:38] Seems like you can’t read your own keys, maybe. [00:04:43] matanya: what do you think? [00:05:34] 3Wikibugs, Tool-Labs, Phabricator: wikibugs test bug - https://phabricator.wikimedia.org/T1152#1042643 (10Krinkle) 5stalled>3Invalid a:3Krinkle Bazinga! [00:05:35] 3Wikibugs: wikibugs should notify on dependency changes - https://phabricator.wikimedia.org/T77006#1042646 (10Krinkle) [00:06:48] adrewbogott: yes but I have a file id_rsa instead [00:07:28] Can you run with -vvv again and paste output? [00:08:27] https://dpaste.de/2P6J [00:14:47] Baccanal: what is your username on wikitech? [00:15:15] andrewbogott: Automatik [00:15:31] ok. [00:15:50] I don’t know what’s happening… everything in the logs is consitent with just having a mismatched key :( [00:26:10] ok, so it's weird [00:29:31] Baccanal: someone else might have new ideas to help sort this [00:41:36] PROBLEM - Free space - all mounts on tools-webproxy is CRITICAL: CRITICAL: tools.tools-webproxy.diskspace._var.byte_percentfree.value (<25.00%) [00:46:55] I go, thanks andrewbogott for your help [00:51:39] RECOVERY - Free space - all mounts on tools-webproxy is OK: OK: All targets OK [00:54:29] 3Labs: Missing record in replica - https://phabricator.wikimedia.org/T89689#1042701 (10Phe) 3NEW [00:57:59] 3Labs: Missing record in replica - https://phabricator.wikimedia.org/T89689#1042709 (10scfc) a:3Springle @Springle, could you look into this as well, please? [01:47:11] 3Labs, Labs-Vagrant, Wikimedia-Labs-Other, Wikimedia-Labs-wikistats, Wikimedia-Labs-wikitech-interface, Wikimedia-Labs-General, Beta-Cluster, Project-Creators, Wikimedia-Labs-Infrastructure, Tool-Labs-tools-Article-request, Tool-Labs, Wikimedia-Labs-extdist, MediaW... - https://phabricator.wikimedia.org/T89270#1042736 [04:22:21] 3Tool-Labs: Tools crontab replacement must check whether run as root - https://phabricator.wikimedia.org/T87527#1042850 (10scfc) [04:22:22] 3Tool-Labs: tools-trusty uses local crontab instead of remote (tools-cron?) crontab - https://phabricator.wikimedia.org/T86445#1042851 (10scfc) [04:24:57] 3Tool-Labs: tools-trusty uses local crontab instead of remote (tools-cron?) crontab - https://phabricator.wikimedia.org/T86445#1042853 (10scfc) [04:24:58] 3Tool-Labs: Tools crontab replacement must check whether run as root - https://phabricator.wikimedia.org/T87527#1042852 (10scfc) [04:28:54] 3Tool-Labs: tools-trusty uses local crontab instead of remote (tools-cron?) crontab - https://phabricator.wikimedia.org/T86445#1042855 (10scfc) [04:28:56] 3Tool-Labs: Tools crontab replacement must check whether run as root - https://phabricator.wikimedia.org/T87527#1042854 (10scfc) [04:29:08] 3Tool-Labs: Tools crontab replacement must check whether run as root - https://phabricator.wikimedia.org/T87527#993597 (10scfc) [04:29:42] 3Tool-Labs: tools-trusty uses local crontab instead of remote (tools-cron?) crontab - https://phabricator.wikimedia.org/T86445#968396 (10scfc) (This "blocking" vs. "blocked by" business is absolutely frustrating :-(.) [06:03:41] 3Tool-Labs-tools-Other: tmg/articlemedia tool not working - https://phabricator.wikimedia.org/T89695#1042897 (10Phoebe) 3NEW [06:37:44] legoktm i think noone has issues with it as long as its readable [06:38:36] But going back to all green is also fine with me [06:43:28] okay :/ [06:48:11] PROBLEM - Puppet failure on tools-exec-11 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [06:54:49] legoktm: the colors look cool and help to distinguish tag types, but they are maybe too inconsistent between clients... [06:55:23] (it looks great in irssi, less so in ircloud) [07:18:12] RECOVERY - Puppet failure on tools-exec-11 is OK: OK: Less than 1.00% above the threshold [0.0] [07:53:41] 3Tool-Labs: Java jobs stop working - https://phabricator.wikimedia.org/T88799#1042987 (10dnaber) p:5Triage>3High [07:56:57] 3Tool-Labs: Java jobs stop working - https://phabricator.wikimedia.org/T88799#1042988 (10dnaber) I've set this to priority 'high' now because the jobs need to run 24 hours per day to be really useful. Also, I have to log in about 5 times or so per week to restart the jobs. If we cannot solve this issue, I'll nee... [08:27:01] 3Tool-Labs: Java jobs stop working - https://phabricator.wikimedia.org/T88799#1043022 (10valhallasw) Please check the output of ```qacct -j fr-feedcheck``` to see what 'max vmem' is. Does it get up to 7500m? In addition, is the job still running or not? If it's still running, it's not a memory issue. Last,... [12:04:51] Hi [12:05:16] I've a problem: I'm not in the same groupe that my tool account. What should I do? [12:05:21] group* [12:13:31] Baccanal: I'm not sure what you mean by that. Do you mean 'groups' doesn't return the tool? [12:13:40] in that case: have you tried logging out and in again? [12:17:43] == {{langue|fr}} == [12:17:43] === {{S|étymologie}} === [12:17:43] : {{cf|administrateur|système}}. [12:17:44] === {{S|nom|fr}} === [12:17:44] {{fr-rég|ad.mi.ni.stʁa.tœʁ sis.tɛm|p=administrateurs systèmes}} [12:17:44] '''administrateur systèmes''' {{pron|ad.mi.ni.stʁa.tœʁ sis.tɛm|fr}} {{m}} [12:17:44] # {{informatique|fr}} Personne responsable des [[serveur]]s d’une organisation. [12:17:45] ==== {{S|traductions}} ==== [12:17:45] {{trad-début|}} [12:17:45] * {{T|en}} : {{trad-|en|system administrator}} [12:17:46] * {{T|nl}} : {{trad+|nl|systeembeheerder‎}} [12:17:46] {{trad-fin}} [12:17:47] [[Catégorie:Métiers du secteur tertiaire en français]] [12:17:47] {{clé de tri|administrateur systemes}} [12:18:20] sorry for this unfortunate copy and paste [12:31:04] for more clarity: [12:31:15] valhallasw`cloud: in fact, I logged in with SSH, and copied a file from /home/botomatik to /data/project/botomatik. But when I 'become botomatik', I can't write the file copied from /home/botomatik [12:32:18] Baccanal: right, because the group is your user group, not the tool group. [12:32:58] Baccanal: either take the file as tool ( 'take ') or chown user:group as user [12:34:35] ok take works! [14:43:49] 3Labs, Wikimedia-Labs-wikitech-interface: Wikitech may fail to add the project-bastion group when shell access is granted - https://phabricator.wikimedia.org/T89667#1043590 (10coren) I've spent most of the last day trying to catch this bug "in the act" with little success beyond observing that jobs set to update... [15:49:10] andrewbogott_afk: Ping me once you're around? [16:15:14] Coren, andrewbogott_afk: https://phabricator.wikimedia.org/T88923 [16:15:34] it's a failed disk, no response for over a week [16:17:15] paravoid: I never saw that one go by. I'll sync up with Chris asap to swap it. [16:17:22] thx [16:17:28] needs to be tagged with #ops-eqiad [16:17:51] it's not yet, it should be investigated first to make sure it's a HW problem indeed [16:22:42] 3Gerrit-Patch-Uploader: Gerrit Patch Uploader does not reliably clean up after itself in /tmp - https://phabricator.wikimedia.org/T88517#1043719 (10Fomafix) 5Resolved>3Open REOPEN The error occurs again: ``` Result from uploading patch: /data/project/gerrit-patch-uploader/git/bin/git clone --depth=1 ssh://g... [16:23:35] paravoid: That's what I'm looking at now. [16:23:47] awesome [16:24:33] hostbyte=DID_NO_CONNECT is 99% sure to be a dead disk. Updating ticker now. [16:28:48] yup [16:30:11] 3Labs: virt1002 broken disk? - https://phabricator.wikimedia.org/T88923#1043738 (10coren) Disk taken out of array because: [943491.790588] scsi 6:0:0:0: [sda] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK [943491.790590] scsi 6:0:0:0: [sda] CDB: Write(10): 2a 00 00 00 00 47 00 00 01 00 [943491.790595] en... [16:30:27] 3Labs, ops-eqiad, operations: virt1002 broken disk? - https://phabricator.wikimedia.org/T88923#1043740 (10coren) [16:30:45] 3Labs, ops-eqiad, operations: virt1002 broken disk? - https://phabricator.wikimedia.org/T88923#1043744 (10coren) p:5High>3Unbreak! [16:31:00] 3Gerrit-Patch-Uploader: Gerrit Patch Uploader does not reliably clean up after itself in /tmp - https://phabricator.wikimedia.org/T88517#1043745 (10valhallasw) Odd. I cleared out /tmp again (which was again filled by g-p-u... or at leat 3.5GB of it) and restarted the webservice; maybe the fixed code wasn't loaded? [16:33:39] 3Gerrit-Patch-Uploader: Gerrit Patch Uploader does not reliably clean up after itself in /tmp - https://phabricator.wikimedia.org/T88517#1043757 (10valhallasw) ``` File "/usr/lib/pymodules/python2.7/flup/server/fcgi_base.py", line 558, in run protocolStatus, appStatus = self.server.handler(self) File "/u... [16:34:37] [13gerrit-patch-uploader] 15valhallasw pushed 1 new commit to 06master: 02http://git.io/NANO [16:34:37] 13gerrit-patch-uploader/06master 14e90e5e3 15Merlijn van Deen: import shutil for great glory [16:35:17] 3Gerrit-Patch-Uploader: Gerrit Patch Uploader does not reliably clean up after itself in /tmp - https://phabricator.wikimedia.org/T88517#1043762 (10valhallasw) 5Open>3Resolved Deployed with import. Should *really* be OK now :-p [16:35:43] 3Labs: Labs NFSv4/idmapd mess - https://phabricator.wikimedia.org/T87870#1043767 (10coren) After investigation and testing with labstore2001, disabling idmap entirely will be //approximately// a noop but will require a reboot and restart of labstore. (I plan on using the restart required by the shelves being mo... [16:39:06] RECOVERY - Free space - all mounts on tools-webgrid-02 is OK: OK: All targets OK [16:54:44] problems ? [16:55:42] PROBLEM - Host tools-webproxy is DOWN: PING CRITICAL - Packet loss = 100% [16:56:26] PROBLEM - Host tools-exec-cyberbot is DOWN: CRITICAL - Host Unreachable (10.68.16.39) [16:57:02] PROBLEM - Host tools-exec-03 is DOWN: CRITICAL - Host Unreachable (10.68.16.32) [16:57:02] PROBLEM - Host tools-webgrid-04 is DOWN: CRITICAL - Host Unreachable (10.68.17.174) [16:57:03] labs seems to be down. http://tools.wmflabs.org/ etc [16:57:12] PROBLEM - Host tools-exec-09 is DOWN: CRITICAL - Host Unreachable (10.68.17.64) [16:57:35] yep [16:58:40] PROBLEM - Host tools-exec-07 is DOWN: CRITICAL - Host Unreachable (10.68.16.36) [17:00:13] Coren, legoktm ^ [17:00:23] meh [17:00:28] Yuvi|Vacation: proxy is down? [17:00:35] he's on vacation [17:00:40] labs are having vacation as well [17:00:44] heh [17:01:39] ok so who do we have here who isn't on vacation [17:01:49] Coren: are you here? it seems that web proxy is not working [17:01:54] wm-bot... [17:02:00] !ping [17:02:01] !pong [17:02:13] Not on vacation. ^^ see. [17:02:14] wm-bot is superbot is survives every outage [17:02:19] too well written program [17:02:26] Not every... [17:02:31] if all programs were so epic as wm-bot world would be a better place [17:02:38] +1 [17:03:05] ok so icinga just died [17:03:12] houston WE GOT A PROBLEM [17:03:58] andrewbogott: or Coren if you are about labs seems to be having some kind of issue :) [17:04:32] chasemp: I’m in a call… will try to catch up [17:04:49] andrewbogott: no worries just hadn't seen anyone ping you yet [17:04:58] andrewbogott: ping [17:05:10] andrewbogott: did you notice that labs just blown up? [17:05:31] Can y’all be more specific? I’m multitasking, haven’t had a chance to read the backscroll [17:05:46] for beginning, webproxy doesn't work [17:05:49] it times out [17:05:52] ok [17:05:59] but other vm's are down as well [17:06:02] per icinga [17:06:06] lot of boxes died [17:06:31] you see all these people are joining ^ because they are confused why their boxes just died [17:06:35] they are all too shy to speak [17:06:44] they just silently watch you and wait for you to save us [17:08:25] nah, we're just trying to read the channel logs before re-asking if the world has died [17:11:02] andrew@virt1005:/$ free [17:11:03] -bash: /usr/bin/free: Input/output error [17:11:20] So one of the virt hosts is dying. I’m investigating :( [17:11:33] Hi [17:11:47] Hi Qcoder00, issue already reported, seems to be recovery time [17:11:48] Why does labs havign an issue slow down other sites? [17:12:00] I stand corrected [17:12:05] I thought labs was on a seperate circuit? [17:14:43] It is, but if you have gadgets that access labs then you sort of bridge the gap :) [17:14:49] But only for your machine [17:15:04] andrewbogott, chasemp I noticed that Special:NovaProxy shows no entries for deployment-prep [17:15:05] but I was pretty sure we had a few just a couple of days ago [17:15:06] maybe related, I don't know [17:15:16] Qcoder00, separate circuit to what? the rest of production? [17:15:21] Yeah [17:15:25] it is [17:15:30] Hmmm [17:15:38] SO Wikisource shouldn't be slowing down? [17:15:42] no [17:15:46] Strange [17:18:48] Hmm seems a Gadget is brdiging the gap [17:18:58] Turned them alll off and it's running much faster [17:19:36] You had a gadget on pulling data from labs every page view? :p [17:20:50] Yeah... [17:20:53] Not sure which though [17:21:23] Oh dear [17:22:35] hi [17:23:15] Qcoder00, enwikisource? [17:23:20] Yeah [17:23:51] Still not sure which Gadget would do that? [17:23:59] one of these: MediaWiki:BookMaker.js, MediaWiki:Gadget-BugStatusUpdate.js, MediaWiki:Gadget-WSexport.js, MediaWiki:Gadget-WhatLeavesHere.js, MediaWiki:Gadget-ocr.js, MediaWiki:Mobile.js, MediaWiki:OCR.js, MediaWiki:TemplateScript/pagetools-config.js [17:24:19] Here is a list of hosts affected by the current outage: https://phabricator.wikimedia.org/P305 [17:24:30] probably one prefixed with Gadget-, though a gadget could include any of them [17:26:40] MediaWiki:Gadget-WSexport.js, (snickers...) [17:26:58] (Sorry) [17:27:10] PROBLEM - Host tools-submit is DOWN: CRITICAL - Host Unreachable (10.68.17.1) [17:27:34] PROBLEM - Host tools-webgrid-tomcat is DOWN: CRITICAL - Host Unreachable (10.68.16.29) [17:30:20] 3Wikimedia-Labs-Infrastructure: source group field is confusing - https://phabricator.wikimedia.org/T69759#1044008 (10chasemp) 5Open>3declined a:3chasemp meh, I made this report ages ago in bugzilla because I found this confusing. I'm closing it myself as I see no real actionable. [17:37:58] good news, virt1005 survived a reboot so we may be able to rescue these VMs. [17:38:11] It’ll be a bit before they’re up and running though [17:40:12] Gah! Fine time I took to get lunch. [17:41:24] More precisely, fine time virt1005 took to die. [17:41:35] andrewbogott: Point me at work I can do to speed things up. [17:42:08] Coren: ok. I’m going to cold-migrate everything to virt1012. [17:43:11] andrewbogott: Giving up on virt1005? Do we know what died? [17:43:37] disk controller [17:44:29] dmesg [17:46:50] Coren: https://phabricator.wikimedia.org/P304 [17:46:52] andrewbogott: afaict, the box is fully up right now; or is a controller entirely missing? [17:47:03] Also, review please? https://gerrit.wikimedia.org/r/#/c/191079/ [17:47:15] I cycled power and it came back up but now I’m very suspicious. [17:47:25] (backscroll about all this in #security) [17:48:45] I just read the backscroll and I agree with the evactuation - that's a nasty thing even if it was a fluke and I'd make sure it was before trusting it again. [17:49:38] yeah [17:49:50] Next task after this fire is out is a req for some new HPs [18:01:52] Found the first gadget that's problematic with labs... [18:02:01] MediaWiki:OCR.js [18:10:23] question how hard is it to have the web proxy on multiple systems like WDQ ? [18:13:48] GerardM-: With the current setup, fairly hard - it'd need a HA redesign. That said, the current outage does suggest that it might be worth the trouble. [18:31:43] GerardM-: The webproxy should be back around 15-20 min from now. [18:37:36] GerardM-: It should be up now, actually. [18:38:30] RECOVERY - Host tools-webproxy is UP: PING OK - Packet loss = 0%, RTA = 0.51 ms [18:42:52] PROBLEM - Puppet failure on tools-exec-catscan is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [18:46:33] 3Tool-Labs, Wikimedia-Labs-Infrastructure: Make ar_content_format and ar_content_model available on ToolLabs - https://phabricator.wikimedia.org/T89741#1044220 (10Umherirrender) 3NEW [18:46:40] RECOVERY - Host tools-submit is UP: PING OK - Packet loss = 0%, RTA = 0.70 ms [18:56:05] Hmm OCR is still down, but that's not a big inconnvenience [19:03:27] Qcoder00: Does it live in tools or on its own project? [19:03:35] Tools I think [19:03:59] MediaWiki:OCR.js is the script, but it uses lab resource I think [19:07:52] RECOVERY - Puppet failure on tools-exec-catscan is OK: OK: Less than 1.00% above the threshold [0.0] [19:15:51] webproxy seems not up? https://phab-01.wmflabs.org/ [19:16:00] mentioning because it seemed like it was meant to be back [19:16:59] chasemp: there are two different proxies, one for tools and one for everything else. [19:17:05] the tools proxy is back, the other one, not yet. [19:17:10] ah shows me lack of knowledge :) [19:29:41] Hoi, WDQ is still down ... it could be up but is not [19:30:43] given that it is load balanced .... [19:43:00] 3Labs: New hp servers for labs - https://phabricator.wikimedia.org/T89752#1044517 (10Andrew) 3NEW [19:46:37] 3Labs: New hp servers for labs - https://phabricator.wikimedia.org/T89752#1044537 (10Andrew) Specs and pricing here: https://rt.wikimedia.org/Ticket/Display.html?id=8390 [20:12:08] chasemp: proxy is back, finally [20:12:57] tx [20:13:49] do we have an ETA for beta cluster being functional again? [20:21:26] Hi! [20:22:07] I'm getting an error when using jstart. Am I right here? [20:24:24] Coren, petan: ? [20:24:52] seth: We're at reduced capacity atm because of a hardware failiure - that is almost certainly the issue. [20:25:08] the message is "libgcc_s.so.1 must be installed for pthread_cancel to work" and "/var/spool/gridengine/execd/tools-exec-05/job_scripts/8221399: line 4: 17637 Aborted (core dumped) [...]" [20:25:24] Ah, no - that's just an out-of-memory. :-) [20:25:56] if I don't use jstart, everything works fine. [20:26:00] https://wikitech.wikimedia.org/wiki/Help:Tool_Labs#Why_am_I_getting_errors_about_libgcc_s.so.1_must_be_installed_for_pthread_cancel_to_work.3F [20:28:18] Oops, sorry for not having read the faq sufficiently. Thanks for the link, I'll try that. :-) [20:47:55] Coren: is web proxy back up? [20:48:15] Betacommand: It is, though not all webgrid nodes are back yet (they are being evacuated now) [20:49:17] I can't restart or stop my webservice. screenshot: http://img42.com/DEp5h+ [20:49:44] what should I do ? [20:50:15] ArashPT: We're at reduced capacity atm due to hardware failure. You can ^C this - your webservice will remain queued and will be able to restart once the nodes return to availability [20:50:28] thanks [20:51:20] petan: "[Wikitech-l] New recent changes library", broken urls (and no 404 handlers on wikitech apparently) [20:52:49] Coren: A new user cannot (and never could) log in to gerrit. It seems his LDAP entry does not have the mail property set. [20:52:50] Can I verify somewhere/somehow that the user has set an email address on wikitech? Or how would the user get his LDAP "mail" set? [20:54:10] qchris: Almost certainly related to https://phabricator.wikimedia.org/T89001 [20:54:55] * qchris reads [20:54:59] qchris: There is little that can be done retroactively except patch the LDAP entry by hand. Can do, if you need it; but please add to the phab task with the symptom [20:55:38] k. I'll add it to the ticket, then ping you about manually adding the mail. [21:00:06] 3Labs: Wikitech creates broken LDAP entry for new instances and users - https://phabricator.wikimedia.org/T89001#1044752 (10coren) [21:03:15] 3Labs: LDAP user without "mail" property got created - https://phabricator.wikimedia.org/T89760#1044759 (10QChris) 3NEW [21:14:24] RECOVERY - Host tools-exec-03 is UP: PING OK - Packet loss = 0%, RTA = 0.75 ms [21:17:37] PROBLEM - Free space - all mounts on tools-webproxy is CRITICAL: CRITICAL: tools.tools-webproxy.diskspace._var.byte_percentfree.value (<12.50%) [21:18:31] Coren: Im getting a 504 for my tool [21:20:24] 3Labs: Wikitech creates broken LDAP entry for new instances and users - https://phabricator.wikimedia.org/T89001#1044814 (10QChris) [21:20:25] 3Labs: LDAP user without "mail" property got created - https://phabricator.wikimedia.org/T89760#1044811 (10QChris) 5Open>3Resolved a:3QChris Coren added the mail property by hand for the user. Thanks! [21:25:45] RECOVERY - Host tools-exec-07 is UP: PING OK - Packet loss = 0%, RTA = 0.55 ms [21:27:06] RECOVERY - Host tools-exec-09 is UP: PING OK - Packet loss = 0%, RTA = 0.78 ms [21:27:38] RECOVERY - Free space - all mounts on tools-webproxy is OK: OK: All targets OK [21:30:48] Coren: task 7741127 isnt dieing [21:33:07] Betacommand: It "lived" on a server that is currently down. qdel -f did the trick. [21:33:30] Coren: thanks [21:34:14] RECOVERY - Host tools-webgrid-04 is UP: PING OK - Packet loss = 0%, RTA = 0.74 ms [21:34:33] Ah, and there is returns. [21:35:04] RECOVERY - Host tools-exec-cyberbot is UP: PING OK - Packet loss = 0%, RTA = 0.50 ms [21:35:40] 3Tool-Labs: Java jobs stop working - https://phabricator.wikimedia.org/T88799#1044847 (10dnaber) In about 90% the jobs are not running anymore. What's the correct behavior if the DB cannot be connected? Just crash and assume the process will automatically be restarted? Or manually try reconnecting after a wait p... [21:41:27] RECOVERY - Host tools-webgrid-tomcat is UP: PING OK - Packet loss = 0%, RTA = 0.50 ms [22:31:43] (03PS1) 10Legoktm: Send AutoWikiBrowser to #autowikibrowser [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/191199 [22:31:47] Reedy: ^ [22:31:55] (03CR) 10Legoktm: [C: 032] Send AutoWikiBrowser to #autowikibrowser [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/191199 (owner: 10Legoktm) [22:32:05] (03Merged) 10jenkins-bot: Send AutoWikiBrowser to #autowikibrowser [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/191199 (owner: 10Legoktm) [22:32:18] (03CR) 10Reedy: "Thanks! :D" [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/191199 (owner: 10Legoktm) [22:32:53] !log tools.wikibugs Updated channels.yaml to: 8ed0a167e287b7c3374f8b9b7e556e9b4b6180d6 Send AutoWikiBrowser to #autowikibrowser [22:32:58] Logged the message, Master [23:10:42] andrewbogott, is everything supposed to be back to normal now? [23:11:05] Krenair: yep [23:12:55] btw andrewbogott, why do I still get referenced to labsconsole when logging into stuff? [23:13:01] references* [23:13:48] Krenair: can you be more specific? [23:13:58] If you are having access problems, please see:https://labsconsole.wikimedia.org/wiki/Access#Accessing_public_and_private_instances [23:14:39] hm, interesting, I will look [23:15:14] Krenair: looks to me like that’s an instance with an obsolete puppet applied to it. what box? [23:15:31] deployment-bastion [23:15:47] yeah, that has its own puppet host. [23:15:50] also, beta is broken because: [23:15:50] Managed by… not me. [23:15:51] krenair@deployment-db1:~$ sudo service mysql status [23:15:51] * MySQL is not running, but PID file exists [23:17:10] Krenair: That's not surprising; all the affected instances have crashed violently. [23:17:23] Krenair: hm, probably a side-effect of an unexpected reboot on deployment-db1 [23:17:30] I started it [23:17:34] I don’t know immediately how to cheer it up [23:17:34] beta appears much less broken than normal [23:17:46] * Coren chuckles. [23:17:58] Less broken than normal when it crashes. :-) [23:18:17] !log deployment-prep Started mysql on deployment-db1; beta now appears much less broken than before [23:18:21] Logged the message, Master [23:19:53] Coren: I would like to murder tools-webproxy-jessie and tools-webproxy-test since they were built using an obsolete image type. Can you assure me that they’re OK to kill, or do I need to wait for Yuvi? [23:20:02] ? the api on beta has been returning "DB connection error" but now returns "The wiki is currently in read-only mode". progress? [23:20:21] chrismcmahon, I turned on the DB server [23:20:27] so it's less broken. [23:20:35] that would account for it [23:20:41] andrewbogott: I have confidence level 99% that they are useless; I wouldn't hesitate to shut them down. [23:20:55] Isn't morebots supposed to log stuff to a page on wikitech? [23:20:56] cool, thanks [23:21:01] It does not appear to have worked. [23:22:23] chrismcmahon, although I don't see it being read-only... details? [23:22:29] http://en.wikipedia.beta.wmflabs.org/w/index.php?title=User:Krenair&action=history [23:23:28] I see it… https://wikitech.wikimedia.org/wiki/Labs_Server_Admin_Log [23:23:28] Krenair: seems to be 100% back for me now also, thanks [23:23:48] Krenair: ‘Started mysql on deployment-db1’ that, you mean? [23:24:01] andrewbogott, aha, there it is [23:24:41] didn't show on RC because labslogbot uses the bot flag [23:27:04] PROBLEM - Host tools-webproxy-jessie is DOWN: CRITICAL - Host Unreachable (10.68.17.147) [23:27:30] hm, useless but monitored [23:27:59] That's automagical from the puppet manifest. [23:28:12] Oh, ok. [23:28:21] So, I don’t need to unregister? [23:28:55] PROBLEM - Host tools-webproxy-test is DOWN: CRITICAL - Host Unreachable (10.68.16.113) [23:31:16] I mean the monitoring was added automagically; it won't go away that easy. :-P [23:34:16] OK, if nothing else is breaking at this exact minute, I may go and get some lunch. [23:34:22] * andrewbogott counts to ten with crossed fingers [23:36:20] I’ll send a proper outage email when I return. [23:36:58] Coren, I wasn’t much good at delegating during this one but I nonetheless appreciate your support :) [23:41:19] andrewbogott: Go lunch. I'm glad I could be of moral support if nothing else; at least I was able to baby the tools instances as they came back up.