[00:00:37] Happy UTC New Year [00:15:01] 🎉 [00:20:16] Happy New Year, Dispenser [00:23:28] Happy New Year to you too, Platonides [02:38:31] my webservice is at "dr" state, it does not work right now and I can't delete or restart it. It's been like this for maybe 8 hours. What should I do? [03:40:42] (03PS1) 10Ricordisamoa: Use url_for() instead of bare relative URLs in templates [labs/tools/ptable] - 10https://gerrit.wikimedia.org/r/261795 [04:09:39] (03PS1) 10Ricordisamoa: Use the recommended boilerplate notice for the Apache License [labs/tools/translatemplate] - 10https://gerrit.wikimedia.org/r/261796 [04:10:42] (03CR) 10Ricordisamoa: [C: 032 V: 032] Use the recommended boilerplate notice for the Apache License [labs/tools/translatemplate] - 10https://gerrit.wikimedia.org/r/261796 (owner: 10Ricordisamoa) [04:27:19] 6Labs, 10Tool-Labs: tools-bastion-02 /dev/vda1 (root mount) 100% full - https://phabricator.wikimedia.org/T122713#1911921 (10zhuyifei1999) 3NEW [04:35:07] (03PS1) 10Ricordisamoa: Update Bootstrap from 3.3.5 to 3.3.6 [labs/tools/translatemplate] - 10https://gerrit.wikimedia.org/r/261797 [04:35:46] (03CR) 10Ricordisamoa: [C: 032 V: 032] Update Bootstrap from 3.3.5 to 3.3.6 [labs/tools/translatemplate] - 10https://gerrit.wikimedia.org/r/261797 (owner: 10Ricordisamoa) [04:45:41] 6Labs, 10Tool-Labs: tools-bastion-02 /dev/vda1 (root mount) 100% full - https://phabricator.wikimedia.org/T122713#1911930 (10zhuyifei1999) Removed files in /tmp belonging to my tool with ``` tools.yifeibot@tools-bastion-02:~$ for FILE in $(ls -l /tmp | grep yifeibot | awk '{print $(NF)}'); do rm -rv /tmp/$FILE... [05:39:01] (03PS1) 10Ricordisamoa: Switch to Python 3 [labs/tools/ptable] - 10https://gerrit.wikimedia.org/r/261800 [05:41:31] (03CR) 10Ricordisamoa: [C: 032 V: 032] Switch to Python 3 [labs/tools/ptable] - 10https://gerrit.wikimedia.org/r/261800 (owner: 10Ricordisamoa) [05:46:35] (03PS2) 10Ricordisamoa: Use url_for() instead of bare relative URLs in templates [labs/tools/ptable] - 10https://gerrit.wikimedia.org/r/261795 [05:47:24] (03CR) 10Ricordisamoa: "PS2 is rebase only" [labs/tools/ptable] - 10https://gerrit.wikimedia.org/r/261795 (owner: 10Ricordisamoa) [06:00:03] (03CR) 10Ricordisamoa: [C: 032 V: 032] Use url_for() instead of bare relative URLs in templates [labs/tools/ptable] - 10https://gerrit.wikimedia.org/r/261795 (owner: 10Ricordisamoa) [06:07:17] Hi [06:07:37] root, including /tmp, is full on tools-dev aka tools-bastion-02 [06:22:08] liangent: I reported https://phabricator.wikimedia.org/T122713 [06:24:04] zhuyifei1999_: already found it myself [06:24:18] k [06:35:01] (03PS1) 10Ricordisamoa: Clarify copyright statement, make space for more contributors [labs/tools/ptable] - 10https://gerrit.wikimedia.org/r/261802 [06:36:11] (03CR) 10Ricordisamoa: [C: 032 V: 032] Clarify copyright statement, make space for more contributors [labs/tools/ptable] - 10https://gerrit.wikimedia.org/r/261802 (owner: 10Ricordisamoa) [06:38:32] My webservice is at "dr" state. It does not work right now and I can't stop or restart it. It's been like this for nearly 12 hours. What should I do? [06:49:40] (03PS1) 10Ricordisamoa: functools.ttl_cache() does not exist, always use cachetools [labs/tools/ptable] - 10https://gerrit.wikimedia.org/r/261803 [06:51:16] (03CR) 10Ricordisamoa: [C: 032 V: 032] functools.ttl_cache() does not exist, always use cachetools [labs/tools/ptable] - 10https://gerrit.wikimedia.org/r/261803 (owner: 10Ricordisamoa) [09:17:32] (03PS12) 10Ricordisamoa: Added a Wikidata-based "chart of the nuclides" under /nuclides [labs/tools/ptable] - 10https://gerrit.wikimedia.org/r/245591 (owner: 10ArthurPSmith) [09:19:17] (03CR) 10Ricordisamoa: "PS12 is shallow rebase only" [labs/tools/ptable] - 10https://gerrit.wikimedia.org/r/245591 (owner: 10ArthurPSmith) [09:21:19] (03PS13) 10Ricordisamoa: Added a Wikidata-based "chart of the nuclides" under /nuclides [labs/tools/ptable] - 10https://gerrit.wikimedia.org/r/245591 (owner: 10ArthurPSmith) [09:22:32] (03CR) 10Ricordisamoa: "PS13 standardizes copyright statements once and for all" [labs/tools/ptable] - 10https://gerrit.wikimedia.org/r/245591 (owner: 10ArthurPSmith) [09:27:45] (03PS14) 10Ricordisamoa: Added a Wikidata-based "chart of the nuclides" under /nuclides [labs/tools/ptable] - 10https://gerrit.wikimedia.org/r/245591 (owner: 10ArthurPSmith) [09:28:32] (03CR) 10Ricordisamoa: "PS14 switches to Python 3" [labs/tools/ptable] - 10https://gerrit.wikimedia.org/r/245591 (owner: 10ArthurPSmith) [09:49:07] (03PS1) 10Ricordisamoa: Fix ApiElementProvider under Python 3 [labs/tools/ptable] - 10https://gerrit.wikimedia.org/r/261804 [09:50:27] (03CR) 10Ricordisamoa: [C: 032 V: 032] Fix ApiElementProvider under Python 3 [labs/tools/ptable] - 10https://gerrit.wikimedia.org/r/261804 (owner: 10Ricordisamoa) [09:53:14] (03PS15) 10Ricordisamoa: Added a Wikidata-based "chart of the nuclides" under /nuclides [labs/tools/ptable] - 10https://gerrit.wikimedia.org/r/245591 (owner: 10ArthurPSmith) [09:54:23] (03CR) 10Ricordisamoa: "PS15 is rebase only" [labs/tools/ptable] - 10https://gerrit.wikimedia.org/r/245591 (owner: 10ArthurPSmith) [11:12:30] 6Labs, 10Tool-Labs: tools-bastion-02 /dev/vda1 (root mount) 100% full - https://phabricator.wikimedia.org/T122713#1911995 (10valhallasw) Seems to have been a transient issue: ``` /dev/vda1 18G 14G 3.7G 79% / ``` but graphite suggests something is... [11:20:49] 6Labs, 10Tool-Labs: tools-bastion-02 /dev/vda1 (root mount) 100% full - https://phabricator.wikimedia.org/T122713#1911998 (10valhallasw) Some `du` later, lots of space seems to be in: ``` valhallasw@tools-bastion-02:/var/log/account$ ls -lh total 5.2G -rw-r----- 1 root adm 852M Jan 1 11:18 pacct -rw-r----- 1... [11:50:34] a few hours [12:11:39] my tools return 404, web server seems up from qstat but webservice restart timeout trying to kill the server [12:12:08] it runs on tools-webgrid-lighttpd-1201 which seems down, I can't ssh to [12:21:08] i had a jlocal job scheduled at midnight utc but it seems it was ignored? [12:27:35] valhallasw`cloud, can you get a look if tools-webgrid-lighttpd-1201 has trouble? [12:28:27] phe: what's wrong? [12:28:33] my tools return 404, web server seems up from qstat but webservice restart timeout trying to kill the server [12:28:36] it runs on tools-webgrid-lighttpd-1201 which seems down, I can't ssh to [12:28:56] hrm [12:29:22] other tools on 1201 seems to freeze too [12:30:17] 6Labs, 10Tool-Labs: tools-webgrid-lighttpd-1201 webservices and ssh unaccessible - https://phabricator.wikimedia.org/T122719#1912006 (10valhallasw) 3NEW [12:39:33] 6Labs, 10Tool-Labs: tools-webgrid-lighttpd-1201 webservices and ssh unaccessible - https://phabricator.wikimedia.org/T122719#1912014 (10valhallasw) root login also hangs: ``` debug1: Offering RSA public key: labs-root.id_rsa debug1: Server accepts key: pkalg ssh-rsa blen 277 debug1: Authentication succeeded (p... [12:39:50] phe: tools should be back online now! [12:44:15] yes, thanks [12:49:16] phe: please create a task for it in phab when things like this happen; all the tool admins are subscribed to the #tool-labs project, so all of us get emails when a ticket is created [12:49:32] ok [12:49:49] thanks! [18:02:08] 6Labs, 10Tool-Labs: tools-bastion-02 /dev/vda1 (root mount) 100% full - https://phabricator.wikimedia.org/T122713#1912198 (10scfc) @valhallasw: Do you mean `ice/lab/`? [18:07:44] 6Labs, 10Tool-Labs: tools-bastion-02 /dev/vda1 (root mount) 100% full - https://phabricator.wikimedia.org/T122713#1912202 (10valhallasw) Sorry, yes, tools.icelab. The tool is basically running lots of small sql queries from bash, in parallel from different `screen`s. In general, I would say that's reasonable,... [22:18:43] YuviPanda, andrewbogott: The nodepool Jenkins hosts seem to be awol again. rake-jessie jobs are backing up in zuul [22:19:58] This happened on Wednesday too and was "fixed" either when andrewbogott restarted the nodepool service or when he restarted nova-compute [22:35:56] any NFS issue again? [22:38:40] I got errors like Could not open input file: /data/project/liangent-php/mw/maintenance/cleanupCiteDates.php [22:39:11] but that file is there, as seen from bastions and a random exec host I tried with qlogin [22:40:00] liangent: any clue which host that happened on? [22:40:37] which tool? [22:40:44] oh, liangent-php. got it [22:40:48] valhallasw`cloud: liangent-php [22:40:57] I don't know how to find the host [22:41:14] which task was it? [22:41:24] 6Labs, 10Labs-Infrastructure, 6Release-Engineering-Team, 6operations, and 2 others: rake-jessie jobs stuck due to no ci-jessie-wikimedia slaves being attached to Jenkins - https://phabricator.wikimedia.org/T122731#1912279 (10bd808) 3NEW [22:41:27] valhallasw`cloud: php_cleanupCiteDates_zhwiki [22:44:49] 6Labs, 10Tool-Labs, 5Patch-For-Review: Attribute cache issue with NFS on Trusty - https://phabricator.wikimedia.org/T106170#1912287 (10valhallasw) 5Resolved>3Open This seems to have happened again, this time with /data/project/liangent-php/mw/maintenance/cleanupCiteDates.php on tools-exec-1409. The behav... [22:44:53] liangent: ^ this issue, I think. [22:46:02] even though lookupcache /is/ disabled.... [22:47:31] valhallasw`cloud: did you do anything like the `ls` you mentioned? [22:47:51] the same error is still occurring [22:48:00] on which host? [22:48:44] as in: did you try on tools-exec-1409, or did you resubmit the job? [22:50:01] I resubmitted the job [22:51:44] did it go to a different host? [22:51:47] 6Labs, 10Tool-Labs, 5Patch-For-Review: Attribute cache issue with NFS on Trusty - https://phabricator.wikimedia.org/T106170#1912304 (10valhallasw) And still happening on various other hosts: ``` tools.liangent-php@tools-bastion-01:~$ for i in {01..10}; do echo $i; ssh tools-exec-14$i wc -l /data/project/lian... [22:51:49] yes [22:51:49] ^ [22:52:49] I don't have an immediate solution for you, other than trying to run the script on a precise host instead [22:53:17] valhallasw`cloud: should I issue ls to all those hosts, or leave it there for further investigation? [22:53:52] and how can I pin it to those unaffected hosts in jsub [22:54:16] I'm not sure if the ls will actually fix it, as you'd probably need to ls a whole lot of other dirs as well [22:54:29] you can't; you can only pin to precise or trusty [22:55:57] find -type d -exec ls \{\} \; ? [22:59:17] liangent: Might work. I'm not sure if anyone will have time to look into it soon (definitely not this weekend), so if you need it to work right now, it's something you can try. [22:59:28] but it might make figuring out what happened more difficult [23:02:23] I'm running it on bastion right now (without a jsub) [23:02:29] hope that you don't blame me for this [23:02:44] that won't help [23:02:54] it's not an issue on the nfs server, it's an issue in the nfs client [23:03:03] so you need to run it on each exec host [23:03:05] 6Labs, 10Continuous-Integration-Infrastructure, 10Labs-Infrastructure, 6Release-Engineering-Team, and 3 others: rake-jessie jobs stuck due to no ci-jessie-wikimedia slaves being attached to Jenkins - https://phabricator.wikimedia.org/T122731#1912305 (10Unicornisaurous) [23:03:19] valhallasw`cloud: I mean, I'm running the script on bastion [23:03:45] ah! that's fine as long as it doesn't use excessive resources [23:04:20] Is there a alsbadmin with a bit free time? I need some help [23:05:31] Luke081515: given the time of the year: probably not ;-) but your question might not require a labsadmin [23:06:38] I guess it need one: I need to disable role::phabricator::labs at rcm-3, but every time I tied, the box was ticked after I clicked at "configure" again, so I can't disable it on my own :-/ [23:07:07] Probably best to file a ticket in phab for that [23:07:14] ok [23:09:24] Luke081515: do you have custom puppet groups setup for that project? Just a random thought but I've been bitten before by having the same role in the list twice and only unchecking one or the two boxes. [23:09:50] bd808: I don't have custom puppet roles at all my projects [23:12:23] the instace has currently only this role activated, but I can not deactivate it [23:51:36] 6Labs, 10Labs-Infrastructure: Could not remove role::phabricator::labs from rcm-3 - https://phabricator.wikimedia.org/T122733#1912330 (10Luke081515) 3NEW [23:53:05] 6Labs, 10Labs-Infrastructure: Could not remove role::phabricator::labs from rcm-3 - https://phabricator.wikimedia.org/T122733#1912343 (10Luke081515)