[01:02:11] 6Labs, 10wikitech.wikimedia.org: Section link template/module broken - https://phabricator.wikimedia.org/T107726#1502084 (10Sitic) 3NEW [02:03:48] is it not possible to move lighttpd's logs? [03:16:47] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1409 is CRITICAL 33.33% of data above the critical threshold [0.0] [04:15:34] 6Labs, 10Labs-Infrastructure: Switch to a multi_host nova network using both labnet1001 and labnet1002 - https://phabricator.wikimedia.org/T107731#1502236 (10Andrew) 3NEW a:3Andrew [04:21:47] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1409 is OK Less than 1.00% above the threshold [0.0] [04:22:06] 6Labs, 10Tool-Labs: /srv on tools-static-01 running out of inodes - https://phabricator.wikimedia.org/T107657#1502254 (10Andrew) Is is breaking updates, yes. You say that the problem is 'This uses the standard space/inode ratio'. Is the standard ratio 'barely any inodes'? Or is cdnjs absurdly file-heavy? [05:14:23] 6Labs, 10Tool-Labs: /srv on tools-static-01 running out of inodes - https://phabricator.wikimedia.org/T107657#1502295 (10yuvipanda) So they are currently being served out of tools-web-static-xx or something - which have bigger disks :) but yes cdnjs is crazily inode heavy. Is this problem in the old nodes whic... [06:02:24] 10Tool-Labs-tools-Database-Queries, 6Phabricator: Archive Tool-Labs-tools-Database-Queries project - https://phabricator.wikimedia.org/T107699#1502332 (10Bugreporter) However I don't think we should use Phabricator for this purpose (there're no way to transclude information of all requests in one page). This s... [08:47:46] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1409 is CRITICAL 55.56% of data above the critical threshold [0.0] [09:27:46] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1409 is OK Less than 1.00% above the threshold [0.0] [10:07:32] PROBLEM - Puppet failure on tools-static-01 is CRITICAL 100.00% of data above the critical threshold [0.0] [10:18:49] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1409 is CRITICAL 55.56% of data above the critical threshold [0.0] [10:53:45] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1409 is OK Less than 1.00% above the threshold [0.0] [13:54:47] 10Tool-Labs-tools-Database-Queries, 6Phabricator: Archive Tool-Labs-tools-Database-Queries project - https://phabricator.wikimedia.org/T107699#1502903 (10scfc) Displaying Phabricator query results in MediaWiki is covered by T90432. I personally prefer using Phabricator, as that was its USP: Instead of using v... [14:02:05] 6Labs, 10Tool-Labs: /srv on tools-static-01 running out of inodes - https://phabricator.wikimedia.org/T107657#1502911 (10scfc) (The standard space/inode ratio for ext4 is about 16384 bytes/inode.) [14:06:33] 6Labs, 10Labs-Infrastructure: Switch to a multi_host nova network using both labnet1001 and labnet1002 - https://phabricator.wikimedia.org/T107731#1502939 (10Andrew) ok, I'm digging in the nova-network code, and I see this: # NOTE(vish): if we are not multi_host pass to the network host... [14:19:02] 6Labs, 10Labs-Infrastructure: Switch to a multi_host nova network using both labnet1001 and labnet1002 - https://phabricator.wikimedia.org/T107731#1502992 (10Andrew) OK, new proposal, which I've sent to the openstack list for criticism: 1) Install nova-network on all compute nodes 2) nova-network delete (curr... [14:43:08] 6Labs, 10Tool-Labs: /srv on tools-static-01 running out of inodes - https://phabricator.wikimedia.org/T107657#1503095 (10valhallasw) It's somewhat inode heavy, but not even that crazily. Based on `du -aSb | cut -f1` in /srv/cdnjs/ajax/libs, I get the following: {F339918} where e.g. the leftmost bin counts fi... [14:49:48] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1409 is CRITICAL 55.56% of data above the critical threshold [0.0] [14:54:44] 6Labs, 10Tool-Labs: /srv on tools-static-01 running out of inodes - https://phabricator.wikimedia.org/T107657#1503125 (10Andrew) So, shall we just raise the default inode count for labs_lvm::volume or have a hiera setting? Either way, patches are welcome :) [15:07:42] 6Labs, 10Tool-Labs: /srv on tools-static-01 running out of inodes - https://phabricator.wikimedia.org/T107657#1503150 (10scfc) The problem is only relevant for `tools-static-*`, so I wouldn't change the default for all instances. 4 kByte/inode looks alright to me; we can always recreate the instances in the f... [15:24:48] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1409 is OK Less than 1.00% above the threshold [0.0] [16:18:06] 6Labs, 10Labs-Infrastructure: Switch to a multi_host nova network using both labnet1001 and labnet1002 - https://phabricator.wikimedia.org/T107731#1503356 (10Andrew) ok, I've learned a bit more about transitioning between multiple networks. Barring additional specs, when creating a new instance, nova queries... [16:34:43] 6Labs, 10Labs-Infrastructure: Switch to a multi_host nova network using both labnet1001 and labnet1002 - https://phabricator.wikimedia.org/T107731#1503425 (10scfc) Currently intra-project network traffic isn't blocked by security groups & Co. Will this still be the case for traffic between an "old" and a "new... [17:04:01] 6Labs, 10Labs-Infrastructure: Switch to a multi_host nova network using both labnet1001 and labnet1002 - https://phabricator.wikimedia.org/T107731#1503519 (10Andrew) That's a good question -- I'm not sure. They might wind up isolated from one another. [17:09:12] 6Labs, 3Labs-Sprint-107: Evaluate a 'cluster solution' for use on Tool Labs - https://phabricator.wikimedia.org/T106475#1503529 (10coren) 3b. If the interface maps poorly to the exiting one, then it being flexible enough to provide a legacy interface that does. [17:17:53] 6Labs, 10Labs-Infrastructure: Properly link tools on https://tools.wmflabs.org/ - https://phabricator.wikimedia.org/T107777#1503564 (10Magnus) 3NEW [17:23:28] andrewbogott - do we have API replication lags, too? [17:25:08] or Coren? [17:25:36] doctaxon: You mean, internal to prod? [17:26:42] 6Labs, 3Labs-Sprint-107: Switch NFS server back to labstore1001 - https://phabricator.wikimedia.org/T107038#1503604 (10coren) I'm not sure what extra details to add, though I'll list the exact commands shortly. I've tested that 1001, in its current state, can actually successfully export filesystems to labs i... [17:26:53] Coren I mean API queries [17:27:10] i mean it is MariaDB? [17:27:13] Is it possible to connect to bastion on a port other than 22. I'm working with a new developer in India who is saying they can't ssh into my instance. But I don't think opening up another SSH port will help, since they are still connecting to bastion on port 22 [17:27:45] 6Labs, 10Labs-Infrastructure: Properly link tools on https://tools.wmflabs.org/ - https://phabricator.wikimedia.org/T107777#1503606 (10scfc) [17:27:46] 6Labs, 10Tool-Labs, 5Patch-For-Review: Make list.php not rely on portgranter - https://phabricator.wikimedia.org/T93197#1503607 (10scfc) [17:28:30] Coren - it's running again, thank you, whatever you have done [17:28:37] maybe wikibugs could say what was done exactly? [17:29:12] doctaxon: T'wasn't me. Can't take the credit for it. [17:32:04] 6Labs, 10Analytics, 10Labs-Infrastructure, 3Labs-Sprint-108, 5Patch-For-Review: Set up cron job on labstore to rsync data from stat* boxes into labs. - https://phabricator.wikimedia.org/T107576#1503614 (10yuvipanda) [17:33:28] 6Labs, 10Tool-Labs, 3Labs-Sprint-108: /srv on tools-static-01 running out of inodes - https://phabricator.wikimedia.org/T107657#1503619 (10Andrew) [17:34:49] 6Labs, 6operations, 3Labs-Sprint-102, 3Labs-Sprint-103, and 5 others: Reinstall labstore1001 and make sure everything is puppet-ready - https://phabricator.wikimedia.org/T107574#1503623 (10coren) [17:35:16] 6Labs, 3Labs-Sprint-107, 3Labs-Sprint-108: Switch NFS server back to labstore1001 - https://phabricator.wikimedia.org/T107038#1503624 (10coren) [17:35:44] 6Labs, 3Labs-Sprint-107: Setup monitoring and reporting for disk space usage of each project on NFS - https://phabricator.wikimedia.org/T106476#1503625 (10coren) a:3coren [17:36:05] 6Labs, 3Labs-Sprint-107, 3Labs-Sprint-108: Setup monitoring and reporting for disk space usage of each project on NFS - https://phabricator.wikimedia.org/T106476#1469909 (10coren) [17:37:21] 6Labs: Labs team reliability goal for Q1 2015/16 - https://phabricator.wikimedia.org/T105720#1503633 (10coren) [17:37:21] 6Labs, 3Labs-Sprint-107, 5Patch-For-Review: Make continuous backups of NFS data to codfw - https://phabricator.wikimedia.org/T106474#1503629 (10coren) 5Open>3Resolved Considered resolved since the reinstall is the validation (T107574) [17:38:06] 6Labs: Labs team reliability goal for Q1 2015/16 - https://phabricator.wikimedia.org/T105720#1450103 (10coren) [17:38:08] 6Labs, 3Labs-Sprint-107, 5Patch-For-Review: Make continuous backups of NFS data to codfw - https://phabricator.wikimedia.org/T106474#1503634 (10coren) 5Resolved>3Open Blah. confused two tickets. [17:38:54] 6Labs: Labs team reliability goal for Q1 2015/16 - https://phabricator.wikimedia.org/T105720#1450103 (10coren) [17:38:56] 6Labs, 6operations, 3Labs-Sprint-102, 3Labs-Sprint-103, and 4 others: labstore has multiple unpuppetized files/scripts/configs - https://phabricator.wikimedia.org/T102478#1503645 (10coren) 5Open>3Resolved T107574 contains the remaining todo [17:44:04] 6Labs, 3Labs-Sprint-107, 3Labs-Sprint-108: allow routing between labs instances and public labs ips (done, document) - https://phabricator.wikimedia.org/T96924#1503679 (10Andrew) [17:44:09] 6Labs, 3Labs-Sprint-107, 3Labs-Sprint-108: Evaluate a 'cluster solution' for use on Tool Labs - https://phabricator.wikimedia.org/T106475#1503681 (10yuvipanda) [17:46:53] 6Labs, 3Labs-Sprint-108: Have checkpoint checks for all labs services (Tracking) - https://phabricator.wikimedia.org/T107058#1503685 (10coren) [17:47:29] 6Labs, 3Labs-Sprint-108: Simple method to have a per-project debian repository - https://phabricator.wikimedia.org/T104194#1503686 (10yuvipanda) [17:48:13] 6Labs, 3Labs-Sprint-107, 3Labs-Sprint-108, 5Patch-For-Review: Make continuous backups of NFS data to codfw - https://phabricator.wikimedia.org/T106474#1503688 (10coren) [17:49:03] 6Labs, 6operations, 3Labs-Sprint-107, 3Labs-Sprint-108, 3ToolLabs-Goals-Q4: Investigate kernel issues on labvirt** hosts - https://phabricator.wikimedia.org/T99738#1503690 (10Andrew) [17:49:38] andrewbogott: https://phabricator.wikimedia.org/T104857 [17:51:01] 6Labs, 3Labs-Sprint-105, 3Labs-Sprint-108: Archive NFS data for projects that no longer have NFS - https://phabricator.wikimedia.org/T104857#1503691 (10Andrew) [17:51:13] 6Labs, 3Labs-Sprint-105, 3Labs-Sprint-108: Archive NFS data for projects that no longer have NFS - https://phabricator.wikimedia.org/T104857#1503699 (10Andrew) a:3Andrew [17:56:39] 6Labs, 10Tool-Labs, 3Labs-Sprint-108: Investigate the cause of the (apparently) spurious puppet failures on Tools - https://phabricator.wikimedia.org/T107782#1503709 (10coren) 3NEW a:3coren [17:56:51] 6Labs: start-nfs script warning message is scary and wrong - https://phabricator.wikimedia.org/T101742#1503718 (10yuvipanda) [17:56:51] 6Labs: Labs team reliability goal for Q1 2015/16 - https://phabricator.wikimedia.org/T105720#1503717 (10yuvipanda) [17:57:30] 6Labs, 10Tool-Labs, 3Labs-Sprint-108: Investigate the cause of the (apparently) spurious puppet failures on Tools - https://phabricator.wikimedia.org/T107782#1503719 (10yuvipanda) See labs-l thread about OOM. @scfc was looking into that as well. [17:57:43] 6Labs, 10Tool-Labs, 3Labs-Sprint-108: Investigate the cause of puppet failures on Tools - https://phabricator.wikimedia.org/T107782#1503721 (10yuvipanda) [17:58:06] 6Labs, 10Tool-Labs, 3Labs-Sprint-108: Investigate the cause of puppet failures on Tools - https://phabricator.wikimedia.org/T107782#1503709 (10yuvipanda) Also note that these are new puppet failures - the older ones were just noise from 'puppetmaster restarted to rotate logs' and always happened at the same... [18:01:48] 6Labs, 7Database: Measure capacity and utilization of labsdb*** boxes - https://phabricator.wikimedia.org/T107070#1503730 (10yuvipanda) @jcrespo can you spec out an ideal labsdb machine and we can see if we can find budget for it? [18:09:10] Can someone kill a query I submitted on Quarry? It is taking forever and preventing other people's queries to run. I don't need it anymore [18:09:18] huji: which one? [18:09:24] it shoudn't prevent other queries [18:09:33] oh I see [18:09:37] 4625 to 4627 [18:09:48] oops [18:09:48] looking [18:09:50] I should've run it from bash, I was too lazy [18:09:55] thanks YuviPanda [18:10:13] the hosts are dead [18:10:13] hmm [18:10:21] huji: it shouldn't be your query shouldn't kill it [18:10:25] while on it, please kill 4628 and 4629 [18:10:41] oh regardless, please kill them YuviPanda, they are not needed anymore and they are intesive querise [18:13:52] YuviPanda: unrelated question: the terminal environment we have on tools is bash, right? [18:14:10] huji: yup [18:15:33] YuviPanda: so I created a .bashrc script but it doesn't seem to be working. basically I want to define a bunch of aliases. [18:15:47] let me finish up and fix quarry first... :D [18:15:52] and you think you need .bash_profile [18:16:49] YuviPanda: let me clarify that it works when I log into tools. But after I "become" huji setting the .bashrc for that account doesn't work [18:17:24] YuviPanda: but it can wait [18:17:49] huji: it's not supposed to [18:18:13] valhallasw`cloud: how so? [18:18:48] valhallasw`cloud: isn't it true that when I 'become' the other account, a new home environment is created? the meaning of "~" changes once i run become [18:18:54] but ~/.bashrc is never run ? [18:21:09] huji: become opens a login shell, so .bash_profile is run, not .bashrc. See https://askubuntu.com/questions/376199/sudo-su-vs-sudo-i-vs-sudo-bin-bash-when-does-it-matter-which-is-used [18:21:32] valhallasw`cloud: oh, got it. Thanks! [18:24:10] huji: fixed now. it wasn't your query :) [18:24:18] YuviPanda: thanks :) [18:39:50] 6Labs, 10Analytics, 10Labs-Infrastructure, 3Labs-Sprint-108, 5Patch-For-Review: Set up cron job on labstore to rsync data from stat* boxes into labs. - https://phabricator.wikimedia.org/T107576#1503837 (10yuvipanda) We'll need to have strict and clear guidelines to make sure we don't leak private data. P... [18:41:28] 6Labs, 3Labs-Sprint-105, 3Labs-Sprint-108: Archive NFS data for projects that no longer have NFS - https://phabricator.wikimedia.org/T104857#1503871 (10yuvipanda) So the new setup should be # Look at the nfs-mounts yaml file # Find all the volumes that are actually active # Find all folders that should be a... [19:12:17] YuviPanda: so I can just delete tools-static-01 right now? [19:12:44] andrewbogott: yup. tools-web-static-01 is what's serving [19:12:57] ok, here goes... [19:13:44] !log tools deleted tools-static-01 [19:13:47] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, dummy [19:14:20] 6Labs, 10Tool-Labs, 3Labs-Sprint-108: /srv on tools-static-01 running out of inodes - https://phabricator.wikimedia.org/T107657#1504042 (10Andrew) ok, so I deleted tools-static-01, I guess that means this can be closed? [19:16:59] PROBLEM - Host tools-static-01 is DOWN: CRITICAL - Host Unreachable (10.68.16.20) [19:17:15] it's ok shinken, it's ok [19:23:34] 6Labs, 10Tool-Labs, 3Labs-Sprint-108: /srv on tools-static-01 running out of inodes - https://phabricator.wikimedia.org/T107657#1504103 (10valhallasw) The newer hosts have a 60GB partition instead of a 20GB one. For reference: * tools-static-01 had a 21G cdnjs drive with 1403520 inodes * tools-web-static... [19:32:13] YuviPanda: I made a tool! https://tools.wmflabs.org/bash/ [19:33:57] bd808: not a phabricator plugin? ;-) [19:34:08] valhallasw`cloud: too lazy for that [19:34:39] it's an experiment in using Elasticsearch as a backend for things other than "normal" search [19:39:06] bd808: cool :) [19:39:57] bd808: I've been looking through it for like the past 30 minutes because I have nothing better to do right now :) [19:40:05] that's how you know you've made a good tool [19:40:09] heh. it's kind of fun [19:40:16] and thanks [19:45:21] bd808: nice! Where is the es running? [19:45:43] YuviPanda: in the statshbot project in Labs [19:45:49] *stashbot [19:45:58] Ah nice [19:46:12] next up will be a nice UI for the SAL messages I've been collecting [19:46:33] Nice! [20:15:46] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1409 is CRITICAL 20.00% of data above the critical threshold [0.0] [20:16:53] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Xavier Combelle was created, changed by Xavier Combelle link https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Access_Request/Xavier_Combelle edit summary: Created page with "{{Tools Access Request |Justification=create a tabular way to edit wikidata after an idea on the french village pump https://fr.wikipedia.org/wiki/Wikip%C3%A9dia:Le_Bistro/2_a..." [20:30:47] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1409 is OK Less than 1.00% above the threshold [0.0] [20:56:04] (03CR) 10BryanDavis: [C: 032] Private hiera keys for Sentry [labs/private] - 10https://gerrit.wikimedia.org/r/227927 (https://phabricator.wikimedia.org/T84956) (owner: 10Gergő Tisza) [21:16:48] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1409 is CRITICAL 30.00% of data above the critical threshold [0.0] [21:56:48] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1409 is OK Less than 1.00% above the threshold [0.0] [22:14:47] (03CR) 10BryanDavis: [C: 031] "Apparently I don't have the gerrit rights to merge this" [labs/private] - 10https://gerrit.wikimedia.org/r/227927 (https://phabricator.wikimedia.org/T84956) (owner: 10Gergő Tisza) [22:15:04] YuviPanda: https://gerrit.wikimedia.org/r/#/c/227927/ plz and thank you [22:38:43] 6Labs, 10Tool-Labs, 6Design Research Backlog, 6Learning-and-Evaluation: Organize a (annual?) toollabs survey - https://phabricator.wikimedia.org/T95155#1504758 (10aripstra) [22:41:23] (03PS3) 10Yuvipanda: Private hiera keys for Sentry [labs/private] - 10https://gerrit.wikimedia.org/r/227927 (https://phabricator.wikimedia.org/T84956) (owner: 10Gergő Tisza) [22:41:44] (03CR) 10Yuvipanda: [C: 032 V: 032] Private hiera keys for Sentry [labs/private] - 10https://gerrit.wikimedia.org/r/227927 (https://phabricator.wikimedia.org/T84956) (owner: 10Gergő Tisza) [22:43:39] thanks YuviPanda [23:02:06] 6Labs, 10Tool-Labs: Determine and deploy proper h_vmem resources for execution nodes - https://phabricator.wikimedia.org/T107665#1504876 (10scfc) With the script in P1820 I got the following `h_vmem` sums per host as they are now (well, some hours ago): | Host | sum(h_vmem)/GBytes | tools-exec-1201 | 0.000000... [23:06:02] 6Labs, 10wikitech.wikimedia.org: Fix another broken module import - https://phabricator.wikimedia.org/T107726#1504880 (10Krenair) [23:07:42] 6Labs, 10wikitech.wikimedia.org: Fix another broken module import - https://phabricator.wikimedia.org/T107726#1502084 (10Krenair) 5Open>3Resolved a:3Krenair Fixed by running this in `mwscript eval.php labswiki` on silver: ```$user = User::newFromName( 'Alex Monk' ); $summary = "[[phabricator:T107726|scri... [23:08:13] 6Labs, 10Wikimedia-Site-requests, 10wikitech.wikimedia.org: Fix another broken module import - https://phabricator.wikimedia.org/T107726#1504891 (10Krenair) [23:16:52] 6Labs, 10Wikimedia-Site-requests, 10wikitech.wikimedia.org: Fix another broken module import - https://phabricator.wikimedia.org/T107726#1504907 (10Legoktm) Due to {T91170}, I forgot my script didn't run on silver, sorry. [23:30:40] 6Labs, 10Tool-Labs: Determine and deploy proper h_vmem resources for execution nodes - https://phabricator.wikimedia.org/T107665#1504945 (10scfc) After setting the `h_vmem` `complex_value` with `qconf -me $host`, I restarted the web service for `catscan2`, but to my non-understanding, it was rescheduled on `to... [23:32:14] 6Labs, 10Tool-Labs: Determine and deploy proper h_vmem resources for execution nodes - https://phabricator.wikimedia.org/T107665#1504963 (10scfc) I used `qconf -mc` to set `CONSUMABLE` for `h_vmem` to `YES`. [23:40:09] 6Labs, 10Tool-Labs: Determine and deploy proper h_vmem resources for execution nodes - https://phabricator.wikimedia.org/T107665#1504976 (10scfc) Now `qhost -F h_vmem` looks much more logical: ``` […] tools-webgrid-lighttpd-1401.eqiad.wmflabs lx26-amd64 4 0.07 7.8G 571.3M 488.0M 0.0 Host Re... [23:40:50] 6Labs, 10Tool-Labs: Puppetize that h_vmem is a consumable resource - https://phabricator.wikimedia.org/T107821#1504986 (10scfc) 3NEW