[00:50:26] 10Wikimedia-Labs-General: Install flake8 on labs instances - https://phabricator.wikimedia.org/T90447#1061075 (10scfc) Each Labs project installs its own software. What project do you mean? [00:55:38] 6Labs: Investigate enabling host-based auth to all hosts from bastions - https://phabricator.wikimedia.org/T76971#1061077 (10scfc) hiera would be nice because then individual Labs projects can disable HBA from bastions by setting up different/empy values at wikitech. [01:06:13] 10Wikimedia-Labs-General: Install flake8 on labs instances - https://phabricator.wikimedia.org/T90447#1061102 (10Mjbmr) Tools. [03:25:54] 10Tool-Labs: Install flake8 - https://phabricator.wikimedia.org/T90447#1061219 (10scfc) p:5Triage>3Normal [03:26:56] 10Tool-Labs: Install flake8 - https://phabricator.wikimedia.org/T90447#1059043 (10scfc) [03:26:57] 10Tool-Labs, 7Tracking: Packages to be added to toollabs puppet - https://phabricator.wikimedia.org/T55704#1061223 (10scfc) [05:03:25] (03CR) 10Mattflaschen: "Whoops, this was a duplicate of https://gerrit.wikimedia.org/r/#/c/182869/ ." [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/188302 (owner: 10Mattflaschen) [05:31:51] is labs dead again ? [05:32:51] PROBLEM - Host tools-exec-cyberbot is DOWN: PING CRITICAL - Packet loss = 100% [05:32:53] PROBLEM - Host tools-exec-09 is DOWN: PING CRITICAL - Packet loss = 100% [05:33:10] yeah said as much [05:33:16] PROBLEM - Host tools-exec-03 is DOWN: PING CRITICAL - Packet loss = 100% [05:33:32] PROBLEM - Host tools-submit is DOWN: PING CRITICAL - Packet loss = 100% [05:33:50] PROBLEM - Host tools-webgrid-tomcat is DOWN: PING CRITICAL - Packet loss = 100% [05:34:06] PROBLEM - Host tools-webproxy-test is DOWN: PING CRITICAL - Packet loss = 100% [05:34:21] PROBLEM - Host tools-webgrid-04 is DOWN: PING CRITICAL - Packet loss = 100% [05:35:42] PROBLEM - Host tools-webproxy is DOWN: PING CRITICAL - Packet loss = 100% [05:35:54] PROBLEM - Host tools-exec-07 is DOWN: PING CRITICAL - Packet loss = 100% [05:48:55] YuviPanda hey. [05:49:30] Sri_Designer: hey [05:50:02] i sent you invites to LinkedIn... do you get them? ;) [05:50:29] Sri_Designer: all of linkedin usually goes through to spam :) [05:50:50] Sri_Designer: also, long time no see :) [05:50:59] then why don't you linkedIn me : https://linkedin.com/in/mareklug [05:51:06] indeed. [05:51:14] let me share a great FREE book from Apple (in iBooks): https://itunes.apple.com/us/book/swift-programming-language/id881256329?mt=11 or as public pdf on Google Docs (my tinyurl): http://tinyurl.com/swiftbookapple [05:52:02] I do not use linkedin, I think [05:52:10] hmm, has your account been compromised? :) [05:52:13] your pretty face is there [05:52:43] no, i took a weeklong course on iOS developing, and this is a great book [05:53:14] right. [05:53:26] thank you for the link! I'll keep it in mind if I decide to do iOS dev at any point [05:53:47] anyway, i think this is a neat all-around languiage. am reading thechapter on tail recursion in Swift, pervert that I am. [05:54:13] i am evangelizing porting it to outside iOS world. [05:55:05] btw, it is not just iOS, as it is replacing all Objective-c work other than OS maintenance on Mac OS X [05:56:29] shinken-wm: why you silent?! [05:56:35] did someone silence you? [05:58:18] :) [05:58:25] happy to see you here [06:01:38] Hi GerardM-! [06:01:59] looks like virt1005 is dead [06:05:35] YuviPanda I always considered PDF to be bloat itself, conditioned by Adobe productions, but this dude made 1080 of my screenfuls into a 3.9 MB file... just text and basic formatting, i guess. [06:12:01] web server is down? [06:12:47] Mjbmr: yes, labs outage in progress, I'm looking into it [06:15:29] 6Labs, 10Tool-Labs, 10Beta-Cluster, 6operations: A virt host seems down, taking down all instances with it - https://phabricator.wikimedia.org/T90530#1061420 (10yuvipanda) p:5High>3Unbreak! [07:36:32] PROBLEM - Host tools-webproxy-01 is DOWN: CRITICAL - Host Unreachable (10.68.17.139) [07:36:42] PROBLEM - Host tools-webproxy-02 is DOWN: CRITICAL - Host Unreachable (10.68.17.145) [07:38:24] well done shinken-wm [07:53:42] RECOVERY - Host tools-webproxy is UP: PING OK - Packet loss = 0%, RTA = 1.10 ms [08:01:24] PROBLEM - Host tools-webproxy is DOWN: CRITICAL - Host Unreachable (10.68.16.4) [08:02:04] RECOVERY - Host tools-exec-03 is UP: PING OK - Packet loss = 0%, RTA = 0.82 ms [08:03:41] RECOVERY - Host tools-webproxy is UP: PING OK - Packet loss = 0%, RTA = 0.58 ms [08:03:49] RECOVERY - Host tools-exec-07 is UP: PING OK - Packet loss = 0%, RTA = 0.87 ms [08:03:53] RECOVERY - Host tools-webgrid-tomcat is UP: PING OK - Packet loss = 0%, RTA = 0.98 ms [08:03:55] RECOVERY - Host tools-webproxy-test is UP: PING OK - Packet loss = 0%, RTA = 0.88 ms [08:04:13] RECOVERY - Host tools-webgrid-04 is UP: PING OK - Packet loss = 0%, RTA = 0.75 ms [08:04:41] RECOVERY - Host tools-exec-09 is UP: PING OK - Packet loss = 0%, RTA = 0.63 ms [08:05:03] RECOVERY - Host tools-exec-cyberbot is UP: PING OK - Packet loss = 0%, RTA = 0.80 ms [08:06:41] RECOVERY - Host tools-submit is UP: PING OK - Packet loss = 0%, RTA = 1.38 ms [08:08:34] PROBLEM - Puppet failure on tools-exec-07 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [08:16:00] PROBLEM - Puppet failure on tools-webproxy-01 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [0.0] [08:16:41] 6Labs, 10Tool-Labs, 7Tracking: Make toollabs reliable enough (Tracking) - https://phabricator.wikimedia.org/T90534#1061596 (10yuvipanda) 3NEW [08:19:05] 6Labs, 10Tool-Labs: Define expected SLA for tools - https://phabricator.wikimedia.org/T90535#1061606 (10yuvipanda) 3NEW [08:21:05] RECOVERY - Puppet failure on tools-webproxy-01 is OK: OK: Less than 1.00% above the threshold [0.0] [08:25:49] 10Tool-Labs-tools-Other: Script to produce #target list of users for input languages - https://phabricator.wikimedia.org/T90536#1061630 (10Nemo_bis) 3NEW a:3Nemo_bis [08:28:35] RECOVERY - Puppet failure on tools-exec-07 is OK: OK: Less than 1.00% above the threshold [0.0] [09:03:43] 10Tool-Labs-tools-Other: Commons Helper (on Tool Labs) failing on English Wikisource - https://phabricator.wikimedia.org/T90510#1061691 (10Aklapper) Thanks for taking the time to report this! Is this about http://tools.wmflabs.org/commonshelper/ ? Exact links are always welcome. [09:18:06] 6Labs, 10Tool-Labs: Define expected SLA for tools - https://phabricator.wikimedia.org/T90535#1061698 (10yuvipanda) [09:18:09] 6Labs, 10Tool-Labs, 7Tracking: Make toollabs reliable enough (Tracking) - https://phabricator.wikimedia.org/T90534#1061697 (10yuvipanda) [09:24:22] 6Labs, 10Tool-Labs, 7Tracking: Make sure that toollabs can function fully even with one virt* host fully down - https://phabricator.wikimedia.org/T90542#1061721 (10yuvipanda) 3NEW [09:25:13] 6Labs, 10Tool-Labs: Test and verify that OGE master/shadow failover works as expected - https://phabricator.wikimedia.org/T90546#1061746 (10yuvipanda) 3NEW [09:29:21] 6Labs, 10Wikimedia-Hackathon-2015: Labs web proxy should be load-balanced and tolerate the failure of virt host - https://phabricator.wikimedia.org/T89995#1061766 (10yuvipanda) [09:29:22] 6Labs, 10Tool-Labs, 7Tracking: Make sure that toollabs can function fully even with one virt* host fully down - https://phabricator.wikimedia.org/T90542#1061765 (10yuvipanda) [09:31:51] 10Tool-Labs: Install flake8 on Tool-Labs - https://phabricator.wikimedia.org/T90447#1061797 (10Aklapper) [09:33:18] 6Labs, 10Tool-Labs: Generic services nodes should be redundant so OGE can reschedule them onto another machine if one goes down - https://phabricator.wikimedia.org/T90557#1061836 (10yuvipanda) 3NEW [09:35:57] Merlissimo: hey! around? [10:24:28] 6Labs, 10Tool-Labs, 7Tracking: Replace bigbrother and ssh-cron-thingy with service manifests - https://phabricator.wikimedia.org/T90561#1061952 (10yuvipanda) 3NEW [10:26:45] 6Labs, 10Tool-Labs, 7Tracking: Replace bigbrother and ssh-cron-thingy with service manifests - https://phabricator.wikimedia.org/T90561#1061959 (10yuvipanda) [10:31:46] 10Tool-Labs-tools-Other: Commons Helper (on Tool Labs) failing on English Wikisource - https://phabricator.wikimedia.org/T90510#1061962 (10Peteforsyth) [11:04:09] PROBLEM - Host tools-webproxy-jessie is DOWN: CRITICAL - Host Unreachable (10.68.17.147) [11:50:38] can someone restart https://tools.wmflabs.org/quick-intersection/ ? [11:51:43] jzerebecki: sure [11:51:48] btw why is the error page empty now? [11:51:57] before it was actually helpful [11:52:31] jzerebecki: that’s just a result of today’s outage [11:53:17] 6Labs, 10Wikimedia-Hackathon-2015: Labs web proxy should be load-balanced and tolerate the failure of virt host - https://phabricator.wikimedia.org/T89995#1062060 (10yuvipanda) p:5Low>3High Toollabs was out twice over the last few days because of non-redundancy in tools-webproxy [11:57:26] 6Labs, 10Staging: Increase Security Groups quota on Wikitech staging project - https://phabricator.wikimedia.org/T90473#1062066 (10faidon) [11:57:48] 6Labs, 10Tool-Labs, 10Beta-Cluster, 6operations: A virt host seems down, taking down all instances with it - https://phabricator.wikimedia.org/T90530#1062069 (10yuvipanda) 5Open>3Resolved a:3yuvipanda Fixed now - the instances were all running but no network access. Restarting nova network was of no... [11:58:24] YuviPanda: if users where to always complain about a certain tool being down, would it be ok for a admin of tools-labs to add a bigbrotherrc for it? [11:58:31] 6Labs, 10Tool-Labs, 10Beta-Cluster, 6operations: Investigate and do incident report for strange virt1012 issues - https://phabricator.wikimedia.org/T90566#1062073 (10yuvipanda) 3NEW a:3yuvipanda [11:58:39] jzerebecki: yup, I do that all the time. [11:58:50] jzerebecki: if someone complains *once* I add it [11:58:58] I never restart without adding one [11:59:08] YuviPanda: want my help with stuff like that? [11:59:29] jzerebecki: ooooh, are you volunteering to be a toollabs root? I’m collecting volunteers :) [11:59:35] yes [11:59:50] :D sweet! I’ll add you to the list and follow up? [11:59:56] jzerebecki: do you have an NDA with the WMF signed? [12:00:22] 6Labs, 10Tool-Labs, 10Beta-Cluster, 6operations: Investigate and do incident report for strange virt1012 issues - https://phabricator.wikimedia.org/T90566#1062073 (10yuvipanda) a:5yuvipanda>3Andrew [12:00:50] YuviPanda: yes, you can see by checking the ldap group [12:00:59] jzerebecki: sweet! that simplifies things. [12:01:05] bbl eating [12:01:36] 6Labs, 10Tool-Labs, 10Beta-Cluster, 6operations: A virt host seems down, taking down all instances with it - https://phabricator.wikimedia.org/T90530#1062087 (10yuvipanda) [12:03:44] 6Labs: Wikitech creates broken LDAP entry for new instances and users - https://phabricator.wikimedia.org/T89001#1062109 (10yuvipanda) p:5Unbreak!>3Normal (changing priority back to normal after a week of inactivity) [12:06:55] 6Labs, 10Tool-Labs: Add more toollabs volunteer roots - https://phabricator.wikimedia.org/T90568#1062111 (10yuvipanda) 3NEW [12:08:28] 6Labs, 10Tool-Labs, 7Tracking: Provide webservice bigbrotherrc for actively used tools - https://phabricator.wikimedia.org/T90569#1062122 (10valhallasw) 3NEW [12:09:51] 6Labs, 10Tool-Labs, 7Tracking: Add 'file a bug' link to tool labs error pages - https://phabricator.wikimedia.org/T90570#1062129 (10valhallasw) 3NEW [12:09:57] 6Labs, 10Tool-Labs, 7Tracking: Provide webservice bigbrotherrc for actively used tools - https://phabricator.wikimedia.org/T90569#1062135 (10yuvipanda) +1. I actively added this for all of Magnus' tools. [12:10:15] 10Tool-Labs: Add 'file a bug' link to tool labs error pages - https://phabricator.wikimedia.org/T90570#1062129 (10valhallasw) [12:10:38] 6Labs, 10Tool-Labs: Provide webservice bigbrotherrc for actively used tools - https://phabricator.wikimedia.org/T90569#1062138 (10valhallasw) [12:12:00] valhallasw: have you seen https://phabricator.wikimedia.org/T90561 [12:25:15] 6Labs, 10Tool-Labs: Make webservice2 write out a bigbrotherrc file - https://phabricator.wikimedia.org/T90574#1062176 (10yuvipanda) 3NEW [12:25:39] valhallasw: jzerebecki https://phabricator.wikimedia.org/T90574 is probably also a great way to add more .bigbrotherrc files [12:28:25] 6Labs, 10Tool-Labs: Provide webservice bigbrotherrc for actively used tools - https://phabricator.wikimedia.org/T90569#1062197 (10yuvipanda) T90574 should make this easier. [12:28:28] YuviPanda: yeah. don't care if its bigbrother or manifest thingie [12:28:49] valhallasw: yup, yup. for now bigbrother. [12:28:54] valhallasw: stabilize first, replace later [12:37:40] YuviPanda: do you need more help with those efforts ? [12:49:19] matanya: right now? fixing https://phabricator.wikimedia.org/T90574 would be most helpful :D [12:57:55] YuviPanda: where is that in the puppet repo ? [12:58:08] matanya: find . -name ‘webservice2' [12:58:16] should be in modules/toollabs/files [12:58:17] i think [12:59:32] thanks [13:50:15] 6Labs, 10Tool-Labs, 7Tracking: Make sure that toollabs can function fully even with one virt* host fully down - https://phabricator.wikimedia.org/T90542#1062341 (10coren) For the record, this basically requires three things: (1) That the shadow master be on a different virt host than the grid master (that i... [13:54:08] 6Labs, 10Tool-Labs, 7Tracking: Make sure that toollabs can function fully even with one virt* host fully down - https://phabricator.wikimedia.org/T90542#1062350 (10yuvipanda) @coren: I've filed blocking tasks for things I think need to be done. [13:56:31] hey Coren [13:56:42] Hoy hoy. [13:57:07] What was the outage? [13:57:50] Coren: mystery [13:58:23] Coren: virt1012 basically killed network connectivity of all instances on it [13:58:23] Oh, ffs. virt1012? That's the one we evacuated 1005 to! [13:58:26] and had to be restarted [13:58:27] yesh [13:58:36] so it was an almost identical outage [13:59:55] But the networking going down like this is so... random! [14:00:10] Coren: yup, yup [14:04:41] Tools-submit is going to be tricky. Putting the crontabs themselves on the shared fs would be a first step, but we have to find a way to make sure that exactly one cron is running them at any one time. [14:08:37] Coren: have you seen https://phabricator.wikimedia.org/T90534 [14:09:10] Yes, I'm catching up to phab now. Lots of activity tonight. [14:17:37] Coren: cool. 40mins to start of NFS outage? [14:19:07] Depending on the order in which Chris does things. I'll focus on labstore1001 when things go back up to add the shelf and restart, and once it's back up we'll do the rounds of instances to check that everything is up? [14:19:31] Coren: sounds good. [14:19:46] Thankfully, at the raid controller level it's just adding jbod. [17:56:19] RECOVERY - SSH on tools-submit is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0) [17:58:03] !log tools rebooting *all* webgrid jobs on toollabs [17:58:30] Logged the message, Master [17:58:56] RECOVERY - SSH on tools-login is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0) [18:00:50] 6Labs, 10MediaWiki-extensions-OpenStackManager, 10Tool-Labs, 10Tool-Labs-tools-Article-request, and 9 others: Labs' Phabricator tags overhaul - https://phabricator.wikimedia.org/T89270#1062768 (10Aklapper) a:5Aklapper>3None [18:00:54] 10Tool-Labs: Decide how to handle gridengine on Debian Jessie - https://phabricator.wikimedia.org/T90369#1062773 (10yuvipanda) p:5Low>3Normal Right. This also means that migrating tools to Jessie is a no-no in the near-to-mid term. I wonder what our alternative distributed job schedulers are. [18:00:58] 6Labs, 10Tool-Labs: Define expected service level agreement for tools - https://phabricator.wikimedia.org/T90535#1062778 (10Legoktm) [18:01:01] 6Labs, 10MediaWiki-extensions-OpenStackManager, 10Tool-Labs, 10Tool-Labs-tools-Article-request, and 9 others: Labs' Phabricator tags overhaul - https://phabricator.wikimedia.org/T89270#1062785 (10faidon) So, who's implementing that then? [18:01:05] 6Labs, 7Puppet: Enable including classes via hiera for labs - https://phabricator.wikimedia.org/T90592#1062792 (10yuvipanda) 3NEW [18:01:14] 10Wikibugs: wikibugs test bug part II - https://phabricator.wikimedia.org/T90594#1062819 (10valhallasw) 3NEW [18:01:17] 10Tool-Labs: Have a 'undergoing scheduled maintenance' page for toollabs set up for scheduled maintenance - https://phabricator.wikimedia.org/T90595#1062829 (10yuvipanda) 3NEW [18:01:20] RECOVERY - SSH on tools-dev is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0) [18:05:03] PROBLEM - Host tools-webproxy-jessie is DOWN: CRITICAL - Host Unreachable (10.68.17.147) [18:08:03] Fine time for my bouncer to get ill. [18:08:54] Coren: hey [18:09:04] Coren: grid engine is misbehaving, losing some jobs... [18:09:06] 17:58‧ Things should be on their way back up; NFS is back. [18:09:07] 17:58‧ I need to go make myself a coffee. Be back in 2 min [18:09:09] and some are stuck in qw [18:09:27] YuviPanda: Not all nodes have recovered yet I expect. [18:10:06] Coren: admin is also still in qw [18:10:09] mathoid is complaining as well... I got about 100 emails in the last minute that something is broken [18:10:17] Ohhhh [18:10:22] That's right. [18:10:51] Yeah, the grid has rescheduled a lot of jobs and is currently waiting for load to go back down enough to start them. [18:10:53] PROBLEM - Puppet failure on tools-webproxy-02 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [18:10:55] I was wondering why there was 6 different jobno for xtools... [18:11:02] It may take another 10-15 minutes. [18:11:15] * Coren prods. [18:11:54] How do I kill a jobno that is state dr? [18:11:59] 6Labs, 10hardware-requests, 6operations, 10ops-eqiad: virt1000 memory upgrade - https://phabricator.wikimedia.org/T89266#1063090 (10Andrew) 5Open>3Resolved Done, looks good. [18:12:00] 6Labs, 6operations: OOM on virt1000 - https://phabricator.wikimedia.org/T88256#1063092 (10Andrew) [18:12:06] PROBLEM - Puppet failure on tools-webproxy-01 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [18:12:23] I tried qdel but it said already being deleted. [18:12:31] T13|mobile: It's already on its way out, but the scheduler hasn't reaped it yet. I see pretty much every node is over the load limit as piled up things drain. [18:13:04] labs-morebots: still there? [18:13:05] I am a logbot running on tools-exec-13. [18:13:06] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [18:13:06] To log a message, type !log . [18:13:07] woo [18:13:54] :D [18:14:14] I guess it doesn’t use shared storage so was probably fine all along [18:14:17] 6Labs, 10Continuous-Integration, 6Release-Engineering: Continuous integration should not depend on labs NFS - https://phabricator.wikimedia.org/T90610#1063123 (10Krinkle) 3NEW [18:14:21] Odd; tools-exec-03 has crashed. [18:14:50] hmm, it is listed ACTIVE [18:15:17] YuviPanda: Low prio question: What's the feasability of having ssh working for jenkins>ci slaves running precise when nfs is unavailable? (For next time.) [18:15:28] low prio, I think. [18:15:42] trusty gets keys from LDAP, precise doesn’t, we could probably backport but not wort the efftort atm [18:15:47] might be other ways, thol [18:15:50] Krinkle: Should be fairly easy so long as the user doesn't have a home on NFS [18:15:53] It's a tad annoying to have essentially no test or merge ability for several hours. Especially on a Tuesday [18:16:18] Krinkle: in theory ssh should work for Trusty and Jessie instances. [18:16:23] Did you find otherwise? [18:16:38] Yeah, but most pipelines have at least one Precise pinned job. [18:16:40] oh, sorry, Yuvi said that already [18:16:47] Krinkle: ok, well… that’s the fix :/ [18:16:49] Which from an uptime perspective is the bottle neck [18:17:00] so while trusty instances work. CI is effectively down [18:17:14] Coren: strange result on http://tools.wmflabs.org/ [18:17:25] Krinkle: does that ssh work happen as a special user, or as whoever? [18:17:26] 10Tool-Labs: Decide how to handle gridengine on Debian Jessie - https://phabricator.wikimedia.org/T90369#1063139 (10scfc) Note though that Ubuntu Trusty will be supported upstream till [[https://wiki.ubuntu.com/Releases|April 2019]], so we have much time to come up with a long-lasting setup, so no need to rush. [18:17:34] andrewbogott: jenkins-deploy [18:17:35] YuviPanda: That *is* strange. [18:17:36] You could puppetize a public key and store it off NFS. [18:17:40] andrewbogott: Has its home on /mnt [18:17:56] YuviPanda: I expect the redis keys are out of date. That will probably happen until all webservices have restarted. [18:18:02] Does it need to use $home other than just to get the key? [18:18:04] Coren: hmm, ok. [18:18:24] Coren: heh, I restarted webservice, and now... view-source:http://tools.wmflabs.org/ [18:18:35] I'll just... wait. [18:19:03] YuviPanda: Some of the webgrid nodes are still recovering, which is why so many servers are queued atm. [18:19:09] yeah [18:19:16] I shall just have a little more patience [18:19:20] andrewbogott: It stores data in its home directory, but its home is not between the others. My personal ci labs home is in /home from nfs, but jenkins-slave user has its home on /mnt [18:19:36] So it just needs the ssh key I guess [18:19:45] Krinkle: yeah, that should do it. Should be straightforward. [18:19:52] I dont know if ssh access is the reason its broken [18:20:00] Unless the key is /already/ not on NFS, in whcih case I don’t know what the story is [18:20:17] something somewhere is causing CI not to work properly when labs NFS is unavailable. [18:20:22] Krinkle: do you mind getting this down on the phab ticket? I’m in the middle of something [18:20:32] thx [18:20:36] YuviPanda: tools.admin has been started now, and has the http back. Not perfect yet though, not sure why. [18:20:43] And I don' have the resources to track it down. I filed the bug as reminder for RelEng to triage :) [18:20:53] It'll be back up when you guys are done :) No CI for now. [18:21:43] YuviPanda: Might it be static that is down? [18:22:00] Coren: admin is putting out plain PHP... [18:22:13] don't think that's static being down? [18:22:18] Yeah, I just figured that one out. That is downright bizzare. [18:22:23] http://tools-static.wmflabs.org/static/ is definitely up [18:23:05] YuviPanda: Ah, it's back to (more) hapiness. [18:23:21] slightly, yeah :) [18:23:31] Tool list not working; I expect that's because tools-db is not yet back to health. [18:23:45] What changed that ssh keys for Trustry do not depend on NFS? Is that back-portable? [18:23:48] tools-db shouldn't have been affected at all, no? [18:23:55] that's on bare metal [18:24:02] 6Labs, 10Continuous-Integration, 6Release-Engineering: Continuous integration should not depend on labs NFS - https://phabricator.wikimedia.org/T90610#1063169 (10Krinkle) [18:24:08] Ah, true. [18:24:20] Once upon a time, it wasn't. [18:24:26] * Coren ponders. [18:25:12] YuviPanda: The queues are back down to sane levels. About 20 errored out jobs or so [18:25:35] * Coren will look into the tool list later. [18:26:11] deployment-prep looks up. [18:28:20] YuviPanda: Ah! Found the issue. The bot that keeps the tools table update hasn't recovered yet. [18:28:32] Coren: oh? I saw it as Rr [18:29:32] Yeah, but I think it was broken by the NFS being gone and never got better. [18:30:21] right [18:31:01] Ah, also tools-submit is still ill. [18:31:44] Ima give it a kick in the kernel. [18:31:59] mind if I slink off to sleep now? [18:32:12] Coren: ^ [18:32:27] YuviPanda: Go sleep. Things are reasonably back up to normal and what kinks are left I'll hunt and kill [18:32:32] wheee [18:32:34] thanks :) [18:32:42] do log them all [18:32:44] bye [18:32:49] o/ [18:33:08] !log tools tools-submit not recovering well from outage, kicking it. [18:33:12] Logged the message, Master [18:57:44] Oh, lol. I noticed I still had fastcgi.debug = 1 in lighthttpd.conf in a tool of mine [18:57:47] error.log is 8GB [18:57:55] Sorry guys! [19:02:39] And debug.log-request-handling = "enable" as well [19:02:40] :S [19:10:42] 6Labs, 10Tool-Labs, 10Beta-Cluster, 6operations: Investigate and do incident report for strange virt1012 issues - https://phabricator.wikimedia.org/T90566#1063364 (10Andrew) Incident report is here: https://wikitech.wikimedia.org/wiki/Incident_documentation/20150224-LabsOutage [19:11:05] Krinkle: Heh. No harm done, but performance will have suffered a bit. :-) [19:12:34] Coren: btw, 1) what's policy on access.log? Some are getting quite big. 2) I'd be interested in seeing traffic patterns. Some kind of aggregator of these (possibly done at the proxy level instead) to find popular tools, 404s, traffic spikes by tool etc. [19:13:27] Krinkle: There isn't a policy, as we tend to keep a "don't place arbitrary limits unless there is an actual issue" philosophy. [19:14:10] As for stats collection, it's been on the wishlist for some time but low enough in priority that nobody ever got around to it. It should be fairly easy to do that on the proxy indeed. [19:20:40] PROBLEM - Free space - all mounts on tools-webproxy is CRITICAL: CRITICAL: tools.tools-webproxy.diskspace._var.byte_percentfree.value (<12.50%) [19:27:15] James_F, Krinkle: I tried the "Demos" link from https://www.mediawiki.org/wiki/OOjs_UI/About_the_library, but the oojs-ui tool labs project is serving 404s. [19:28:42] anomie: Yeah, NFS upgrade for Labs broke a lot of things. [19:29:26] 404s shouldn't be a symptom [19:29:36] Because it means the service is up [19:29:40] anomie: Once the upgrade/maintenance is over the tool will need to get kicked by a user, I guess? [19:29:50] Hmm. https://tools.wmflabs.org/oojs-ui/ also 404s. [19:29:57] I docroot pointed at NFS? [19:30:03] Is, even. [19:30:17] James_F: Probably (it is by default) but NFS has been back for some time [19:30:17] Maintenance done? [19:30:20] Anyway, Coren has bigger fish to fry right now. :-) [19:30:42] Oh, the maintenance is done? /topic doesn't say. [19:30:43] RECOVERY - Free space - all mounts on tools-webproxy is OK: OK: All targets OK [19:30:52] An email went out to labs-l about 47 minutes ago saying the maintenance was done [19:30:53] James_F: No, I'm actually in "frying the small fish that got sick" mode. [19:31:29] Coren: Ha. :-) In that case, I'd ask for help but I have no idea if I even have access to Rilke's tool. [19:31:30] anomie: kicking the service now [19:31:55] Hm. Whatever bug there may be is unrelated to the outage. Even a fresh start still has it 404 [19:32:18] Coren: Works now. [19:32:24] https://tools.wmflabs.org/oojs-ui/oojs-ui/demos/index.html#widgets-mediawiki-vector-ltr [19:32:26] Thanks! [19:32:42] James_F: Ah! I was confused by the fact that there seems to be no root document. [19:32:58] Coren: Does it use magic? https://tools.wmflabs.org/oojs-ui/ [19:33:28] Well, there's no rule that says you have to have something at / in your tool. [19:33:29] out of curiosity, wouldn't it be possible to just restart everything that was previously running? [19:34:09] pajz: It's /possible/ but not always a good idea - something that keeps state might be broken by a restart that was unnecessary [19:34:34] pajz: And it's not possible to determine that without knowing what the normal runtime behaviour is supposed to be. [19:34:52] k [19:35:16] 6Labs, 10Tool-Labs: Make webservice2 write out a bigbrotherrc file - https://phabricator.wikimedia.org/T90574#1063508 (10scfc) (BTW, we should get rid of `webservice2` and merge it into `webservice` (with symlink from `webservice2`). Having two scripts with nearly identical names, but different abilities is n... [19:36:06] Coren: Can we list tools as stateless so restarts are fine/encouraged? [19:36:33] James_F: It'd make sense to. [19:36:37] * James_F nods. [19:36:41] Anyway, for another time. [19:36:42] Coren: how can I qdel all jobs listed in qstat in one shot? [19:37:09] T13|mobile: Well, presuming they're all yours, 'qdel -u username' should work IIRC [19:37:32] Indeed, I think this should also work: qdel '*' [19:37:38] (note the quotes) [19:38:42] Perfect [19:48:19] 6Labs, 10Tool-Labs, 7Tracking: Make sure that toollabs can function fully even with one virt* host fully down - https://phabricator.wikimedia.org/T90542#1063556 (10scfc) As I wrote on T89995, I don't think this is feasible. Tools running in the Tools project would require the whole foundation of networking... [19:56:17] 6Labs, 10Tool-Labs: Define expected service level agreement for tools - https://phabricator.wikimedia.org/T90535#1063613 (10scfc) Does this task refer to the Tools project or the individual tools running there? [20:25:14] 6Labs, 10Tool-Labs, 7Tracking: Make toollabs reliable enough (Tracking) - https://phabricator.wikimedia.org/T90534#1063729 (10Betacommand) [20:25:15] 10Tool-Labs: enwiki database dumps missing - https://phabricator.wikimedia.org/T89537#1063730 (10Betacommand) [20:28:29] 6Labs, 10Tool-Labs, 7Tracking: Make toollabs reliable enough (Tracking) - https://phabricator.wikimedia.org/T90534#1063734 (10Betacommand) [20:28:30] 10Wikimedia-Labs-Infrastructure: Create -latest alias for dumps - https://phabricator.wikimedia.org/T47646#1063735 (10Betacommand) [20:36:07] Coren: what should I file UA blocking bugs under? [20:36:24] Betacommand: For tools or the general proxy? [20:37:11] tools specifically, but probably applies to all of labs [20:38:29] Tool-labs for tools normally; though you may want Labs if you want something more general [20:39:08] just found an SEO crawler hitting a lot [20:39:46] * Coren grumbles at people who do not follow RFC 3514 [20:41:00] 10Tool-Labs: Block xovibot user-agent - https://phabricator.wikimedia.org/T90636#1063770 (10Betacommand) 3NEW [20:53:57] why I can't delete a job from grid? it says it's marked for deletion but still is in the list. [21:10:25] 10Tool-Labs-tools-Other: Commons Helper (on Tool Labs) failing on English Wikisource - https://phabricator.wikimedia.org/T90510#1063850 (10Peteforsyth) >>! In T90510#1061691, @Aklapper wrote: > Is this about http://tools.wmflabs.org/commonshelper/ ? Exact links are always welcome. Yes, it is -- sorry for the ov... [21:12:46] Mjbmr: try using -f\ [21:12:51] Mjbmr: try using -f [21:13:27] thanks [21:13:52] Mjbmr: sometimes jobs get stuck [21:14:22] Betacommand: for the most part, that only happens when there is a major outage. [21:14:36] (Planned or not) [21:14:46] it just put some dots, but it's still in the list. [21:15:03] Coren: correct, IVe seen a few oddball other cases [21:15:08] Mjbmr: It may need me to kill it forcibly. [21:15:09] qdel -f 8406804 [21:15:54] it was spawned by webservice2 uwsgi-python start [21:16:11] do you know how I can enable multithreading for uwsgi btw? [21:16:19] Ah! The issue is simpler than this - the job is stuck because the host it runs on is. [21:16:30] So gridengine never gets confirmation that it died. [21:17:15] can you kill it? [21:17:16] * Coren kicks it in the diodes. [21:17:32] Mjbmr: he is working on it [21:17:50] just don't hit the hardware please [21:18:03] Mjbmr: sometimes that is needed [21:19:08] Mjbmr: It's dead Jim. That said, part of the issue is that your script ignores SIGTERM which the grid tried to send when things went pear shaped [21:20:00] I loaded pywikibot in it, it's multithraeding but --enable-threads is not added to /usr/local/bin/tool-uwsgi [21:20:31] I don't know, pywikibot don't accepts ctr+c [21:21:25] That sounds like a bug to me. [21:21:51] Mjbmr: are you using core? [21:22:16] yeah, I can't submit patches for pywikibot, people are not nice to me, why they don't merge my patches? [21:22:33] so must either craft tools from the start or use php. [21:23:03] Mjbmr: core devs have taken a stance and wont listen to others [21:23:19] Mjbmr: I use compat and it works well for me [21:23:19] wat o_O [21:23:42] Mjbmr: core went live with a crapton of issues [21:24:01] ok, can I start a new version of pywikibot? [21:24:19] I have yet to get core working, within about 5 minutes I always run into a fatal error [21:24:59] Mjbmr: What I do is use an SVN checkout of compat which I merge into my personal SVN repo [21:25:08] works well [21:25:32] would you help on a new version? [21:25:43] once in a great while I have to resolve a few conflicts [21:26:13] Mjbmr: just create a fork of the older compat checkout, you shouldnt have much issues [21:26:58] I've been working with trunk since 2010 but core issues are known to me. [21:28:48] or you're right. [21:29:31] Mjbmr: what I do is have a local svn checkout of compat where I do all my coding, and a batch file that copies the files to second directory that I use for submitting to my SVN fork repo [21:30:01] Not the prettiest setup, but it works [21:30:31] compat also has very low coverage on wikibase. [21:31:05] correct [21:31:05] do I get a new branch tho? a repos? or should I just start with github? [21:31:30] I would just use github or what ever repo of your choice [21:31:39] ok [21:32:25] thanks, I'll let you know any update on that. [21:32:28] I actually use the github version of pywiki so I can still use the SVN interface [21:32:54] I got used to it. the git. [21:41:15] Betacommand: https://www.mediawiki.org/wiki/Git/New_repositories/Requests see the bottom [21:42:34] Mjbmr: you shouldnt use a WMF fork [21:43:03] ok, I understand. [21:50:48] Betacommand: pm? [21:51:01] Mjbmr: sure [23:16:08] !log tools.wikibugs legoktm: Deployed 6da78462504cd023e0c31babb5cc56a7eae3a88a Merge "Use brown instead of red for orange (=release) projects" wb2-phab [23:16:14] Logged the message, Master [23:16:24] !log tools.wikibugs legoktm: Deployed 6da78462504cd023e0c31babb5cc56a7eae3a88a Merge "Use brown instead of red for orange (=release) projects" wb2-irc [23:16:27] Logged the message, Master [23:18:06] wikibugs: y u no like -releng? [23:47:20] 6Labs: virt1000 SPOF - https://phabricator.wikimedia.org/T90625#1064307 (10chasemp) [23:54:02] 6Labs, 6operations: Wikitech registration for prior SVN user - https://phabricator.wikimedia.org/T90658#1064379 (10chasemp) [23:55:08] 6Labs, 6operations: Wikitech registration for prior SVN user - https://phabricator.wikimedia.org/T90658#1064224 (10chasemp) @andrew or @robh can we change that email on the signup page to operations@phab.wm.o? and https://phabricator.wikimedia.org/T55793#555374. [23:55:16] 6Labs, 6operations: Wikitech registration for prior SVN user - https://phabricator.wikimedia.org/T90658#1064385 (10chasemp) p:5Triage>3Normal