[00:12:09] 6Labs, 10Labs-Infrastructure, 10Tool-Labs, 10DBA: tools.ptwikis throttled for abusing labs db replica resources - https://phabricator.wikimedia.org/T127228#2037315 (10scfc) (I don't like the word "abusing" here very much as it sounds to me as a English-as-a-foreign-language speaker that someone is knowingl... [00:59:31] Why is there a 5 hour cap on SQL scripts? And how do I raise it? [01:00:35] Dispenser: probably to avoid stuff similar to https://phabricator.wikimedia.org/T127228 [01:02:02] Can I do something like SLOW_OK (like on Toolserver) to keep them from being killed? [01:04:10] Dispenser: i dont know personally, but a ticket with the "DBA" tag would get replies [01:04:39] Dispenser: I would support that. [01:18:46] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Mainsane was created, changed by Mainsane link https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Access_Request/Mainsane edit summary: Created page with "{{Tools Access Request |Justification=development |Completed=false |User Name=Mainsane }}" [01:41:59] Done: https://phabricator.wikimedia.org/T127266 [01:51:33] 10Tool-Labs-tools-Other, 6Community-Tech, 7Community-Wishlist-Survey, 7Milestone: Pageview Stats tool - https://phabricator.wikimedia.org/T120497#2037635 (10MusikAnimal) [01:51:35] 6Labs, 10Tool-Labs, 10DBA: Labs queries die after 5 hours - https://phabricator.wikimedia.org/T127266#2037644 (10Dispenser) 3NEW [02:29:48] 6Labs, 10Beta-Cluster-Infrastructure, 5Patch-For-Review: Soft mount remaining NFS mounts on deployment-prep - https://phabricator.wikimedia.org/T127224#2037726 (10yuvipanda) I've done this now for all things in deployment-prep that still need NFS. The mounts are soft, with a 10s timeout - if the NFS server t... [02:36:29] 6Labs, 10Beta-Cluster-Infrastructure, 5Patch-For-Review: Soft mount remaining NFS mounts on deployment-prep - https://phabricator.wikimedia.org/T127224#2037735 (10yuvipanda) 5Open>3Resolved a:3yuvipanda I just tested this by removing export for one instance from the NFS server, and it handled it perfec... [03:26:01] 6Labs, 10Tool-Labs: puppet failure on a large number of instances - https://phabricator.wikimedia.org/T126165#2037896 (10yuvipanda) [03:26:03] 6Labs, 10Tool-Labs: tools-docker-registry-01 has incorrect puppetmaster key - https://phabricator.wikimedia.org/T126167#2037894 (10yuvipanda) 5Open>3Resolved Um, fixed for real now. [04:11:32] 6Labs, 10Labs-Infrastructure, 6Phabricator: can't log in to phab-01.eqiad.wmflabs - https://phabricator.wikimedia.org/T125666#2037976 (10mmodell) 5Open>3Resolved a:3mmodell [04:21:06] 6Labs, 10Tool-Labs, 6Project-Admins: Migrate Tools access request process to Phabricator - https://phabricator.wikimedia.org/T72625#2037997 (10mmodell) We can create forms now... [04:29:04] 6Labs, 10Labs-Infrastructure, 6Phabricator: can't log in to phab-01.eqiad.wmflabs - https://phabricator.wikimedia.org/T125666#2038015 (10Negative24) @mmodell What was the issue? Where you able to get into phab-02 as well? [04:40:12] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Mainsane was modified, changed by Tim Landscheidt link https://wikitech.wikimedia.org/w/index.php?diff=315047 edit summary: [04:59:21] 6Labs, 10Tool-Labs: provide a more strict robots.txt at Tool Labs - https://phabricator.wikimedia.org/T127206#2035680 (10scfc) We had a more restricted `robots.txt` in the past (March 2014; cf. T63132) and consensus for it. This changed in the following eleven months so when I committed d76c5f0a398b827f999fd1... [06:22:24] 6Labs, 10DBA: Some databases cannot be backed up/replicated on toolsdb - https://phabricator.wikimedia.org/T127164#2038050 (10jcrespo) 5Open>3Resolved Thank you all. Please allow me to help with backups if you do not have them for a faster recovery. [06:30:47] 6Labs, 10Labs-Infrastructure, 10Tool-Labs, 10DBA: tools.ptwikis throttled for excessive usage of labs db replica resources - https://phabricator.wikimedia.org/T127228#2038054 (10jcrespo) [08:10:37] This tool has been down for a few days :/ https://tools.wmflabs.org/glamtools/glamorous.php [08:59:39] 10MediaWiki-extensions-OpenStackManager, 7I18n, 5Patch-For-Review: GENDER support in openstackmanager-addedto, openstackmanager-failedtoadd - https://phabricator.wikimedia.org/T99063#2038233 (10siebrand) 5Open>3Resolved [09:48:23] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1205 is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [0.0] [09:57:44] RECOVERY - Puppet staleness on tools-worker-1001 is OK: OK: Less than 1.00% above the threshold [3600.0] [10:23:21] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1205 is OK: OK: Less than 1.00% above the threshold [0.0] [10:36:51] 6Labs, 10Tool-Labs: provide a more strict robots.txt at Tool Labs - https://phabricator.wikimedia.org/T127206#2035680 (10Nemo_bis) > Many dynamic pages which take long time to load should be added. Do you have a representative sample of such URLs? [10:43:20] 6Labs, 10Tool-Labs: provide a more strict robots.txt at Tool Labs - https://phabricator.wikimedia.org/T127206#2038526 (10Nemo_bis) > See T127066: Bingbot scraping tools? for reason. That's not called a reason, it's called pre-emptive optimisation. The actual bug needs to be fixed here; pages getting visits is... [11:22:53] RECOVERY - Puppet failure on tools-docker-registry-01 is OK: OK: Less than 1.00% above the threshold [0.0] [11:42:34] (03CR) 10Siebrand: [C: 04-1] "As far as I can tell crosswatch doesn't support plural. A bug should be created that requests the i18n feature to be added before this cou" [labs/tools/crosswatch] - 10https://gerrit.wikimedia.org/r/265804 (owner: 10MtDu) [11:51:14] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Jaumeortola was created, changed by Jaumeortola link https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Access_Request/Jaumeortola edit summary: Created page with "{{Tools Access Request |Justification=I will use Tools for the development of a this project (Wikimedia IEG grant): https://meta.wikimedia.org/wiki/Grants:IEG/Proofreading_sem..." [12:06:25] can someone restart meetbot? I have no permission. https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Tools/meetbot [12:07:32] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Jaumeortola was modified, changed by Jaumeortola link https://wikitech.wikimedia.org/w/index.php?diff=315443 edit summary: fix typo [12:22:23] (03PS1) 10Youni Verciti: Add the public_html folder to easily update the html code [labs/tools/vocabulary-index] - 10https://gerrit.wikimedia.org/r/271497 [12:43:19] (03CR) 10Hashar: [C: 04-1] "Please remove the compiled python files (*.pyc). No need for them in the Git repository and will definitely cause troubles." [labs/tools/vocabulary-index] - 10https://gerrit.wikimedia.org/r/271268 (owner: 10Youni Verciti) [13:01:47] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Arc was created, changed by Arc link https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Access_Request/Arc edit summary: Created page with "{{Tools Access Request |Justification=I want to get involved with Wikimedia projects (I'm interested specially on Knowledge Engine) and help somehow. I' |Completed=false |User..." [13:03:51] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Arc was modified, changed by Arc link https://wikitech.wikimedia.org/w/index.php?diff=315502 edit summary: [13:17:28] 6Labs, 10Labs-Infrastructure, 6Operations: Make all ldap users have a sane shell (/bin/bash) - https://phabricator.wikimedia.org/T86668#973669 (10hashar) [13:17:43] (03PS1) 10Youni Verciti: Remove *.pyc [labs/tools/vocabulary-index] - 10https://gerrit.wikimedia.org/r/271509 [13:21:12] (03CR) 10Hashar: "Should be made to https://gerrit.wikimedia.org/r/#/c/271268/ instead." [labs/tools/vocabulary-index] - 10https://gerrit.wikimedia.org/r/271509 (owner: 10Youni Verciti) [13:24:46] !log deployment-prep upgrading elasticsearch to 1.7.5 on cirrus-browser-bot [13:24:46] Please !log in #wikimedia-releng for beta cluster SAL [13:24:50] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/SAL, Master [13:25:38] !log search upgrading elasticsearch to 1.7.5 on cirrus-browser-bot [13:25:41] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Search/SAL, Master [13:34:55] (03PS2) 10Youni Verciti: Initial check-in [labs/tools/vocabulary-index] - 10https://gerrit.wikimedia.org/r/271268 [13:41:32] (03PS1) 10Siebrand: Add L10n-bot. [labs/tools/crosswatch] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/271513 [14:05:03] (03PS2) 10Siebrand: Localisation updates from https://translatewiki.net. [labs/tools/crosswatch] - 10https://gerrit.wikimedia.org/r/271517 (owner: 10L10n-bot) [14:06:12] (03CR) 10Siebrand: "Looks like L10n-bot didn't have submit rights yet. I've prepared that at Icb3fee0187a7cc9a1589bbb43ce73df41cf8c6f4" [labs/tools/crosswatch] - 10https://gerrit.wikimedia.org/r/271517 (owner: 10L10n-bot) [14:39:30] 6Labs, 10Labs-Infrastructure: Can not delete images as 'nodepoolmanager' on 'contintcloud' (nodepool account) - https://phabricator.wikimedia.org/T127310#2039458 (10hashar) [14:59:51] 6Labs, 10Labs-Infrastructure: Can not delete images as 'nodepoolmanager' on 'contintcloud' (nodepool account) - https://phabricator.wikimedia.org/T127310#2039560 (10Andrew) This was probably broken by https://gerrit.wikimedia.org/r/#/c/270781/ @hashar, any idea which of the rights in that file you need enabled? [15:09:32] 6Labs, 10Labs-Infrastructure: Can not delete images as 'nodepoolmanager' on 'contintcloud' (nodepool account) - https://phabricator.wikimedia.org/T127310#2039588 (10Andrew) or... actually it is probably this one: https://gerrit.wikimedia.org/r/#/c/270783/ [15:14:57] (03PS3) 10Youni Verciti: Initial check-in [labs/tools/vocabulary-index] - 10https://gerrit.wikimedia.org/r/271268 [15:17:13] 6Labs, 10Labs-Infrastructure: Can not delete images as 'nodepoolmanager' on 'contintcloud' (nodepool account) - https://phabricator.wikimedia.org/T127310#2039634 (10hashar) ``` $ openstack image delete ci-jessie-wikimedia-1455548346 ERROR: openstack 403 Forbidden: Access was denied to this resource. (HTTP 403)... [16:01:56] 6Labs, 10Tool-Labs: Cluebot writes massive logs that are making labstore run out of space - https://phabricator.wikimedia.org/T127222#2039876 (10chasemp) >>! In T127222#2036751, @DamianZaremba wrote: > I replied to the mailing list about this, in reply to the email about NFS running out of space. > > Cleared... [16:02:25] (03CR) 10Siebrand: [C: 031] Localisation updates from https://translatewiki.net. [labs/tools/crosswatch] - 10https://gerrit.wikimedia.org/r/271517 (owner: 10L10n-bot) [16:02:31] (03CR) 10Siebrand: [C: 031] Update indentation for json [labs/tools/crosswatch] - 10https://gerrit.wikimedia.org/r/271518 (owner: 10L10n-bot) [16:12:47] 6Labs, 10Labs-Infrastructure, 10Incident-Labs-NFS-20151216, 6Operations, 10ops-eqiad: labstore1002 issues while trying to reboot - https://phabricator.wikimedia.org/T98183#2039935 (10chasemp) [16:25:17] (03PS63) 10Ricordisamoa: Initial commit [labs/tools/wikidata-slicer] - 10https://gerrit.wikimedia.org/r/241296 [16:28:02] 6Labs, 10Labs-Infrastructure, 5Patch-For-Review: Can not delete images as 'nodepoolmanager' on 'contintcloud' (nodepool account) - https://phabricator.wikimedia.org/T127310#2039995 (10hashar) Fixed now. I have managed to upload and delete an image. Nodepool caught up with image deletions. [16:28:21] (03CR) 10Ricordisamoa: "PS63 supports the new identifier datatype with a new ExternalIdFormatter extended from StringFormatter; pending completion of https://phab" [labs/tools/wikidata-slicer] - 10https://gerrit.wikimedia.org/r/241296 (owner: 10Ricordisamoa) [16:34:15] andrewbogott: should we test https://gerrit.wikimedia.org/r/#/c/271171/ (lxc container upstart) somewhere? [16:34:43] bd808: I’m in a meeting, but how about if I merge it and then you tell me if it works or not? [16:34:54] sounds good [16:37:01] bd808: merged! You can puppet agent -tv and then reboot your test instance of choice. Sorry I’m not more attentive... [16:37:15] no worries. I'll check it out [16:41:26] andrew, bd808, can you help me? I tried to delete the job 3406366 (tool Luke081515bot), but now the job is deleting since more than 10 minutes [16:41:33] but still listed with qstat [16:43:49] Luke081515: my SGE skillz are not very advanced... [16:44:23] andrewbogott & yuvipanda: Can you help me? [16:45:33] Luke081515: is it gone now? [16:45:50] chasemp: Yeah [16:45:52] yeah it looks like it just ended finally [16:45:58] I force deleted [16:46:02] thanks [16:46:16] is there a way for normal user to do that too, or is this limited? [16:46:20] thanks chasemp [16:46:36] bd808: let's get you rights if you don't have them? [16:46:50] chasemp: I think I have them, I just don't know how to use them :) [16:47:11] I have sudo in the tools project [16:47:43] qdel -f [16:47:48] is like the kill -9 sge equiv [16:47:53] afaik [16:47:54] (03PS1) 10ArthurPSmith: Fixes for nuclides to handle problem of duplicate (and some wrong) returns from SPARQL query [labs/tools/ptable] - 10https://gerrit.wikimedia.org/r/271551 [16:48:26] *nod* [16:50:34] (03CR) 10ArthurPSmith: [C: 031] "This will allow things to display even if we are getting bad data from the query service (at least in the half-life case). Hoping that pr" [labs/tools/ptable] - 10https://gerrit.wikimedia.org/r/271551 (owner: 10ArthurPSmith) [16:58:26] 6Labs, 10MediaWiki-Vagrant, 5Patch-For-Review, 15User-bd808: mediawiki-vagrant on labs requires manual action after instance restart - https://phabricator.wikimedia.org/T127129#2040136 (10bd808) 5Open>3Resolved [16:59:00] (03CR) 10Ricordisamoa: "recheck" [labs/tools/ptable] - 10https://gerrit.wikimedia.org/r/271551 (owner: 10ArthurPSmith) [17:01:26] (03CR) 10jenkins-bot: [V: 04-1] Fixes for nuclides to handle problem of duplicate (and some wrong) returns from SPARQL query [labs/tools/ptable] - 10https://gerrit.wikimedia.org/r/271551 (owner: 10ArthurPSmith) [17:02:43] (03CR) 10Ricordisamoa: [C: 04-1] Fixes for nuclides to handle problem of duplicate (and some wrong) returns from SPARQL query (032 comments) [labs/tools/ptable] - 10https://gerrit.wikimedia.org/r/271551 (owner: 10ArthurPSmith) [17:03:48] (03CR) 10Ricordisamoa: Fixes for nuclides to handle problem of duplicate (and some wrong) returns from SPARQL query (031 comment) [labs/tools/ptable] - 10https://gerrit.wikimedia.org/r/271551 (owner: 10ArthurPSmith) [17:07:16] can someone restart meetbot? I have no permission. https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Tools/meetbot [17:11:14] jzerebecki: I added you to the tool :) [17:12:29] !log tools.meetbot Added JanZerebecki as co-maintainer [17:18:19] !log tools.meetbot Force killed stuck job [17:18:22] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.meetbot/SAL, Master [17:23:24] bd808: now we are both fiddling with it :) [17:23:35] jzerebecki: heh [17:23:52] I had to sudo qdel -f the running job to get it to shut down [17:24:01] I'll let you figure out how to get it started again [17:24:04] ok so i'll restart it now [17:25:09] from the error log it looked like the bot go mad because of an NFS hiccup [17:25:14] *got mad [17:29:23] !log tools.meetbot restarted to get the bot back [17:29:26] bd808: thx [17:29:26] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.meetbot/SAL, Master [17:29:58] jzerebecki: thanks for caring about the bot [17:37:28] reminder: In ~30 minutes all labs instances are going to start rebooting in a totally unpredictable sequence. [17:39:50] andrewbogott: :) https://www.youtube.com/watch?v=eXnvTwRBrgc [17:41:32] andrewbogott: Kernel update? [17:41:41] Luke081515: libc [17:41:57] So we don’t actually have to reboot them, just restart every single process :) [17:42:49] andrewbogott: Do you will reboot every single instance, or the while labs virts? [17:43:18] Luke081515: rebooting labvirtxxxx which has the (in this case, positive) side-effect of rebooting instances as well [17:43:49] andrewbogott: Can you ping me, let's say about 2 minutes before rebooting virt 1006? [17:44:44] Luke081515: chasemp is going to do the reboots. But, in general…. [17:45:05] chasemp: see my last comment, can you? ;) [17:45:16] it’s best to just be tolerant of reboots. Out of 800+ instances it’s hard to stage things very carefully [17:45:57] I don't have sth against them, but actually I'm loading content from my instance, so i want to now, at which time I have to load things at the cache ;) [17:55:03] hi andrewbogott [17:55:10] * andrewbogott waves [17:55:15] hi chasemp [17:55:40] I just applied the fixes to labvirt* so everything should be staged for reboots [17:56:09] hey guys, I'm looking into a hw issue with chris for a minute on labstore things (2 of them I guess) [17:56:15] will be few [17:56:54] ok [17:57:13] a 10-minute delay in a multi-hour process is not likely to make a difference :) [17:57:19] andrewbogott: all the instances that let me ssh in have recent libc [17:58:08] waiting on him to open up a box atm, what's the current state then? [17:58:18] we haven't rebooted anything but all updates should be staged for reboot [17:59:30] am making a list of all tools nodes in various hosts [17:59:41] lol [17:59:44] maybe I can reuse https://etherpad.wikimedia.org/p/tools-reboots-cve-0728 [18:02:21] chasemp: if you're looking for my 'depool node' and 'repool node' scripts, they're at ~/killnode.bash and ~/unkillnode.bash on tools [18:03:52] did you already start rebooting? My instance doesn't let me in: Permission denied (publickey). [18:04:08] sry, my fault [18:04:33] I tried to ssh from another host, not bastion [18:06:39] :D [18:10:26] (03CR) 10ArthurPSmith: "see comments - I will post a patch with fixed formatting and ignoring the warning shortly." (032 comments) [labs/tools/ptable] - 10https://gerrit.wikimedia.org/r/271551 (owner: 10ArthurPSmith) [18:10:50] ok yuvipanda can you walk me through how we have done this [18:10:59] do we drain all exec/webgrid on a particular labvirt then? [18:11:05] chasemp: yeah [18:11:18] chasemp: and do any other failovers as needed [18:11:21] chasemp: if you look at https://etherpad.wikimedia.org/p/tools-reboots-cve-0728 [18:11:27] the top has the failovers we do manually [18:11:41] although I don't failover the master at all usually, just clean stop it before the restart [18:12:24] chasemp: so I just xargs the list of exec and web nodes into killnode.bash [18:12:27] is depool node taking a labvirt as it's arg or an exec vm? [18:12:28] ah [18:12:34] it's an exec vm [18:12:45] (03PS2) 10ArthurPSmith: Fixes for nuclides to handle problem of duplicate (and some wrong) returns from SPARQL query [labs/tools/ptable] - 10https://gerrit.wikimedia.org/r/271551 [18:12:47] there's no easy way to determine that from a labs vm (which instances are on which host) [18:12:54] chasemp: so the etherpad has list of them per virt host [18:14:04] ok so, let's do labvirt1001 one then first eh? [18:14:20] I'm waiting on chris to look at the drive sitch so I have a few to rolling [18:14:36] yuvipanda: mind doing dtach w/ me somewhere? [18:14:41] chasemp: yeah, moment [18:15:17] chasemp: how do I do it again? I have no working long term memory [18:15:23] apparently [18:15:34] where should be run cmd's from to start [18:15:57] chasemp: I run them on tools-login.wmflabs.org usually [18:17:46] (03CR) 10ArthurPSmith: [C: 031] "Ok, this passed pep8 and doesn't print - should be ok to merge?" [labs/tools/ptable] - 10https://gerrit.wikimedia.org/r/271551 (owner: 10ArthurPSmith) [18:19:21] !log tools draining nodes from labvirt1001 [18:19:24] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [18:23:07] chasemp: andrewbogott ok to reboot labvirt1001 anytime now, I think [18:23:19] doing so [18:23:30] * andrewbogott is going to eat lunch but available in a pinch [18:23:33] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Arc was modified, changed by Tim Landscheidt link https://wikitech.wikimedia.org/w/index.php?diff=316272 edit summary: [18:23:38] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Jaumeortola was modified, changed by Tim Landscheidt link https://wikitech.wikimedia.org/w/index.php?diff=316273 edit summary: [18:23:45] chasemp: ok [18:25:27] PROBLEM - Host tools-puppetmaster-01 is DOWN: CRITICAL - Host Unreachable (10.68.22.61) [18:25:27] PROBLEM - Host tools-bastion-02 is DOWN: CRITICAL - Host Unreachable (10.68.16.44) [18:25:39] !log tools.stashbot Added #wikimedia-devtools to channel list [18:25:41] PROBLEM - Host tools-exec-1201 is DOWN: CRITICAL - Host Unreachable (10.68.17.49) [18:25:42] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.stashbot/SAL, Master [18:25:44] hmm [18:25:48] let me kill shinken-wm [18:25:51] PROBLEM - Host tools-webgrid-lighttpd-1409 is DOWN: CRITICAL - Host Unreachable (10.68.18.43) [18:25:52] kk [18:26:02] (03CR) 10Ricordisamoa: Fixes for nuclides to handle problem of duplicate (and some wrong) returns from SPARQL query (031 comment) [labs/tools/ptable] - 10https://gerrit.wikimedia.org/r/271551 (owner: 10ArthurPSmith) [18:26:09] PROBLEM - Host tools-webgrid-generic-1405 is DOWN: CRITICAL - Host Unreachable (10.68.16.110) [18:26:10] (03CR) 10Ricordisamoa: "recheck" [labs/tools/ptable] - 10https://gerrit.wikimedia.org/r/271551 (owner: 10ArthurPSmith) [18:27:32] chasemp: next would be labvirt1003, which has the proxy host. would you like to try failing it over? [18:27:57] chasemp: simply needs the IP address to be switched to the alternate, tools-proxy-02 [18:29:52] sure how to do it? [18:31:01] chasemp: wikitech has 'manage addresses' [18:31:02] cpettet@cair>ssh labvirt1001.eqiad.wmnet 'checkrestart' [18:31:02] Found 0 processes using old versions of upgraded files [18:31:07] yuvipanda: looking [18:31:20] http://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Admin has general docs [18:31:21] yuvipanda: What do I need to do to have a morebots in #countervandalism that does there with !log what !log cvn does here. [18:31:45] like the -releng variant [18:32:04] Krinkle: I've no idea about morebots :D you should ask one of the people listed as maintainers! [18:32:15] https://wikitech.wikimedia.org/wiki/User:Labslogbot [18:32:44] wtf [18:33:05] Krinkle: creating a new one is easy, make me a phab task and I’ll do it [18:33:10] It doens't run in toollabs afaik [18:33:14] andrewbogott: thx [18:33:17] andrewbogott: which project? [18:33:21] Krinkle: it does [18:33:24] Krinkle: but of course you’ll need to figure out where you want the log [18:33:35] Krinkle: um… operations is fine [18:33:36] Yep, the existing Nova project CVN/SAL is fine [18:34:05] We just want to deprecate our own bot. We currently have a bot there that logs to http://countervandalism.net/wiki/Developer_Log which we don't want to use anymore [18:36:49] Did you change sth at the webproxy config? I get networktime out, as I tried to connect to my instance [18:37:05] andrewbogott: labslogproxy is morebots, right? [18:42:14] andrewbogott: ping [18:42:29] andrewbogott: me and chasemp were wondering where the script that does nova starts is? [18:44:28] 6Labs, 10Tool-Labs: Cluebot writes massive logs that are making labstore run out of space - https://phabricator.wikimedia.org/T127222#2040934 (10DamianZaremba) Thanks @chasemp! I can see the logs are current linked to /dev/null. After a very quick look, it seems the MySQL connection is being a little un-relia... [18:45:48] 6Labs, 10Tool-Labs: Cluebot writes massive logs that are making labstore run out of space - https://phabricator.wikimedia.org/T127222#2040955 (10yuvipanda) The logs were also almost 100% 'undefined variable' and 'undefined index' php errors along with associated stack traces, with very little mysql ones. [18:45:54] yuvipanda: did someone changed the config? I can't access all my instances which are using webproxys [18:46:01] (access via web) [18:46:28] Luke081515: best to wait till the reboots are done, I think. the proxy host might be on one of the down instances [18:46:35] hm, ok [18:47:17] Luke081515: yup, I've manually started it [18:48:42] yuvipanda: sorry, back [18:49:03] There isn’t a script, I just dump the output of ‘nova list —all-tenants —host xxx’ into vi [18:49:07] and the s/r [18:49:10] then s/r [18:49:12] yuvipanda: Status switched: timeout => 500 => 200 :D [18:49:18] Luke081515: yeah :) [18:49:30] yuvipanda: that’s straightforward, right? [18:50:39] andrewbogott: am trying to make it a one liner [18:50:41] chasemp: ^ [18:50:50] Would the reboots going on be causing deployment-salt to be down? [18:50:57] I notice it is SHUTOFF [18:51:02] I’m sure that’s possible but exceeds my regexp-fu [18:51:06] ah [18:51:07] ok well [18:51:10] along with a bunch of other hosts including memc03 which is powering-off [18:51:11] I just did virsh --list -all [18:51:16] and am starting that list [18:51:24] although that should be the same list as nova tenants [18:51:34] chasemp: generally I capture a list of what’s running before the shutdown, so I don’t start instances that weren’t running beforehand [18:51:44] ok that we didn't do [18:51:48] should be ok tho? [18:51:49] chasemp: andrewbogott bah, am doing a start [18:51:56] this is all in my email :) [18:52:02] root@labcontrol1001:/home/yuvipanda# nova list --all-tenants --host=labvirt1001 --fields 'name,OS-EXT-SRV-ATTR:host,user_id,state' | grep labvirt1001 | awk '{ print $2;}' | xargs -L1 ./sleepandstart.bash [18:52:07] is what I'm wrunning [18:52:10] that parsoid05 host was inaccessible earlier without being marked SHUTOFF :/ [18:52:19] yuvipanda: hang on a sec I think [18:52:28] but I think wikitech might just be slow with that [18:52:29] I'm already starting things let's see how it finishes and compare lists? [18:52:37] chasemp: ok, I ctrl-c'd [18:52:40] it started a few [18:53:01] chasemp: virsh sounds better since I suppose that'll list only running ones [18:54:11] ok yeah I think your command is better then [18:54:15] let's run that [18:54:17] and see how it conflicts [18:54:19] typically I’d vote for interacting with nova directly so we don’t get inconsistencies with the state in libvirt vs. the state in nova [18:54:23] although in theory nova re-syncs things [18:54:26] pretty sure virsh list all will start things that were stopped even [18:54:33] yeah [18:54:40] agreed [18:55:07] yuvipanda: restart your thing? [18:56:21] (03CR) 10Legoktm: [C: 032] Ignore comments posted to Phabricator by Stashbot [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/270989 (owner: 10BryanDavis) [18:56:34] (03Abandoned) 10Legoktm: Ignore "Stashbot" [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/270349 (owner: 10Legoktm) [18:56:55] (03Merged) 10jenkins-bot: Ignore comments posted to Phabricator by Stashbot [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/270989 (owner: 10BryanDavis) [18:57:28] chasemp: ok [18:57:39] chasemp: running [18:57:51] tx [18:58:07] !log tools.wikibugs legoktm: Deployed 8e84aa72b250ef18421c47ace7b4422e949c5837 Ignore comments posted to Phabricator by Stashbot wb2-irc [18:58:13] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.wikibugs/SAL, Master [19:00:14] 10Tool-Labs-tools-Other, 6Community-Tech, 7Community-Wishlist-Survey, 7Milestone: Pageview Stats tool - https://phabricator.wikimedia.org/T120497#2041047 (10DannyH) [19:02:03] 6Labs, 10Labs-Infrastructure, 10Tool-Labs, 10DBA: tools.ptwikis throttled for excessive usage of labs db replica resources - https://phabricator.wikimedia.org/T127228#2041066 (10Crang115) Hi jcrespo, ptwikis is a wide project with many different small tools used by the portuguese community. Do you know whe... [19:05:19] 6Labs, 10Labs-Infrastructure, 10Tool-Labs, 10DBA: tools.ptwikis throttled for excessive usage of labs db replica resources - https://phabricator.wikimedia.org/T127228#2041070 (10jcrespo) Yes, I can share the logs with you, but privately as I am not sure if it contains (your) private data. [19:07:55] yuvipanda: still running? [19:08:15] chasemp: yup [19:08:21] k [19:11:48] chasemp: done [19:11:58] chasemp: they all seem running [19:12:11] all look started but one [19:12:15] assuming that was shutdown prior [19:12:16] cool [19:12:33] spot checking seems good [19:12:53] chasemp: ok, let's repool! [19:13:40] running now [19:13:44] chasemp: ok! [19:14:47] (03PS3) 10Tim Landscheidt: Add options --help and --version to take [labs/toollabs] - 10https://gerrit.wikimedia.org/r/70058 (owner: 10Platonides) [19:14:49] (03PS2) 10Tim Landscheidt: Let take fail if recursion failed [labs/toollabs] - 10https://gerrit.wikimedia.org/r/268931 [19:14:51] (03PS2) 10Tim Landscheidt: Make take's FD constructor explicit [labs/toollabs] - 10https://gerrit.wikimedia.org/r/268934 [19:15:04] chasemp: andrewbogott I've done [19:15:06] nova list --all-tenants --host=labvirt1003 | grep Running | awk '{ print $2; }' > labvirt1003 [19:15:09] so I've a list of running ones [19:15:33] ok [19:15:38] I see nodes repooled on master [19:15:42] (03CR) 10jenkins-bot: [V: 04-1] Add options --help and --version to take [labs/toollabs] - 10https://gerrit.wikimedia.org/r/70058 (owner: 10Platonides) [19:15:43] on to draining 02? [19:15:53] (03CR) 10jenkins-bot: [V: 04-1] Let take fail if recursion failed [labs/toollabs] - 10https://gerrit.wikimedia.org/r/268931 (owner: 10Tim Landscheidt) [19:16:05] (03CR) 10jenkins-bot: [V: 04-1] Make take's FD constructor explicit [labs/toollabs] - 10https://gerrit.wikimedia.org/r/268934 (owner: 10Tim Landscheidt) [19:16:21] chasemp: andrewbogott asked us to do 02 lat [19:16:26] chasemp: *last [19:16:27] ok, 03? [19:16:29] so let's do that last [19:16:31] chasemp: yeah [19:16:34] going now [19:16:37] chasemp: can you do the unpooling? [19:16:40] ok [19:16:49] hm didn't work [19:16:57] invalid queue "*@" [19:17:13] 6Labs, 10Labs-Infrastructure, 10Tool-Labs, 10DBA: tools.ptwikis throttled for excessive usage of labs db replica resources - https://phabricator.wikimedia.org/T127228#2041097 (10jcrespo) I've sent @Crang115 a snapshot of the tools.ptwikis SHOW PROCESSLIST mysql account, filtered only by his own queries. No... [19:17:28] or not sure [19:17:37] chasemp: what did you run? [19:17:53] cat labvirt1003 | awk '{ print $2;}' | xargs -L1 ./killnode.bash [19:18:38] chasemp: yeah, labvirt1003 seems to be in a different format :D [19:18:46] no need for the awk [19:18:47] right :) [19:18:50] ok [19:19:14] a different format, really? [19:19:44] andrewbogott: I copy pasted them last time differently from this time [19:20:19] ok drained [19:20:30] chasemp: cool! [19:20:31] and proxy is moved off (I think) [19:20:33] so reboot? [19:20:35] chasemp: yeah [19:28:38] chasemp: I've updated etherpad to have info on the checker, so we can failover it back when doing labvirt1009 [19:29:29] k [19:29:34] yuvipanda: 1003 is back [19:29:36] run your thing? [19:31:02] chasemp: k [19:31:33] chasemp: doing [19:39:57] yuvipanda: I realized I don't know when your script is done :) [19:41:54] but I don't see it running so ok [19:46:27] !log tools repool labvirt1003 and depool labvirt1004 [19:46:31] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [19:55:48] I assume tools-cron-01 being a bit dead is due to reboots? [19:56:41] all is in limbo while updates happen, pretty good odds [19:57:43] (03PS1) 10Andrew Bogott: Add some fake certs to make the puppet compiler happy [labs/private] - 10https://gerrit.wikimedia.org/r/271595 [19:58:13] (03CR) 10Andrew Bogott: [C: 032 V: 032] Add some fake certs to make the puppet compiler happy [labs/private] - 10https://gerrit.wikimedia.org/r/271595 (owner: 10Andrew Bogott) [20:02:56] (03PS1) 10Andrew Bogott: One more fake key to satisfy the puppet compiler [labs/private] - 10https://gerrit.wikimedia.org/r/271598 [20:03:08] (03CR) 10Andrew Bogott: [C: 032 V: 032] One more fake key to satisfy the puppet compiler [labs/private] - 10https://gerrit.wikimedia.org/r/271598 (owner: 10Andrew Bogott) [20:26:40] chasemp: hey [20:26:44] sorry that was more than 20mins :| [20:26:47] chasemp: how did it go? [20:27:05] restarting vm's on labvirt1005 now, I think I took out a checker for a minute [20:27:09] but nothing else too amiss [20:28:07] chasemp: ok! checker's on 1009, so you probably took out the webhost running the tools *home* page [20:28:13] there wasn't much we could do about that maybe [20:28:54] can you prepare for taking out 1006 is anything is needed/ [20:28:55] ? [20:29:54] chasemp: yup am doing now [20:29:59] chasemp: there's static web [20:30:08] chasemp: am also moving proxy back to -01 so 1007 is good too [20:31:20] chasemp: tools-proxy-01 isn't up yet even though it was on labvirt1003 [20:31:40] hmmm [20:31:42] my python tool is down, is that expected? [20:31:44] thats no good [20:31:56] ok lets halt here to figure out the deal [20:31:56] Traceback (most recent call last): [20:31:57] File "/usr/local/bin/portreleaser", line 8, in [20:31:57] portgrabber.unregister() [20:31:58] File "/usr/local/lib/python2.7/dist-packages/portgrabber.py", line 44, in unregister [20:32:00] sock.connect((proxy, 8282)) [20:32:02] File "/usr/lib/python2.7/socket.py", line 224, in meth [20:32:04] return getattr(self._sock,name)(*args) [20:32:05] chasemp: ok [20:32:06] socket.error: [Errno 113] No route to host [20:32:28] yuvipanda: I'll run through starting things on 1003 as soon as 1005 is done just to get feedback [20:33:11] chasemp: so some of the instances on 1003 weren't started [20:33:20] my script exited prematurely [20:33:59] ebraminio: /topic, labs-l, etc. [20:34:08] ah [20:34:10] want to run that again to see? [20:34:25] ebraminio: everything has to be restarted because of a security issue [20:34:39] chasemp: am running it just for the shutdown ones [20:36:12] chasemp: all good now [20:37:02] ok I'm about to depool 1006 [20:37:03] !log tools failover proxy back to tools-proxy-01 [20:37:04] cool? [20:37:07] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [20:37:19] chasemp: not yet, let me send a 'wall' on toolsbastion [20:37:26] chasemp: oh yeah, depool is good [20:38:00] done, giving it 3m [20:38:30] chasemp: ok [20:40:06] chasemp: ok, all good for 1006, 07, 08, 09