[00:39:33] PROBLEM - Puppet failure on tools-exec-14 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [00:42:58] could we get morebots restarted pls? [00:58:42] 3Wikimedia-Labs-wikistats: Fix all the Wikia stats - https://phabricator.wikimedia.org/T61943#980983 (10Dzahn) actual cause: unknown table. exiting :p [01:16:36] 3Wikimedia-Labs-wikistats: Fix all the Wikia stats - https://phabricator.wikimedia.org/T61943#980991 (10Dzahn) needed this fix: https://gerrit.wikimedia.org/r/#/c/185357/3 now updates are running (in screen) ..it will take a while but it works now [01:17:25] 3Wikimedia-Labs-wikistats: Add 300 000 wikia wikis to stats table - https://phabricator.wikimedia.org/T38291#980998 (10Dzahn) [01:17:27] 3Wikimedia-Labs-wikistats: Fix all the Wikia stats - https://phabricator.wikimedia.org/T61943#980996 (10Dzahn) 5Open>3Resolved http://wikistats.wmflabs.org/display.php?t=wi [01:44:29] RECOVERY - Puppet failure on tools-exec-14 is OK: OK: Less than 1.00% above the threshold [0.0] [01:56:29] PROBLEM - Puppet failure on tools-exec-14 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [02:40:00] wikibugs: ping [02:40:37] !log tools.wikibugs legoktm: Deployed c61edcfab64d62081edc3ccf89534764017f4a1c Make sure we're in the channel before messaging it wb2-phab [02:40:52] !log tools.wikibugs legoktm: Deployed c61edcfab64d62081edc3ccf89534764017f4a1c Make sure we're in the channel before messaging it wb2-irc [02:40:59] ircnotifier?? [02:41:14] hm..... [02:41:33] RECOVERY - Puppet failure on tools-exec-14 is OK: OK: Less than 1.00% above the threshold [0.0] [04:17:32] PROBLEM - Puppet failure on tools-exec-14 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [04:42:31] RECOVERY - Puppet failure on tools-exec-14 is OK: OK: Less than 1.00% above the threshold [0.0] [05:18:31] PROBLEM - Puppet failure on tools-exec-14 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [05:28:05] could someone run https://wikitech.wikimedia.org/wiki/User:Yuvipanda/Restarting_magnus_wdq [05:43:31] RECOVERY - Puppet failure on tools-exec-14 is OK: OK: Less than 1.00% above the threshold [0.0] [05:59:31] PROBLEM - Puppet failure on tools-exec-14 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [06:24:36] RECOVERY - Puppet failure on tools-exec-14 is OK: OK: Less than 1.00% above the threshold [0.0] [06:40:33] PROBLEM - Puppet failure on tools-exec-14 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [06:44:43] PROBLEM - Puppet failure on tools-submit is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [07:01:00] PROBLEM - Puppet failure on tools-exec-gift is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [07:05:31] RECOVERY - Puppet failure on tools-exec-14 is OK: OK: Less than 1.00% above the threshold [0.0] [07:14:39] RECOVERY - Puppet failure on tools-submit is OK: OK: Less than 1.00% above the threshold [0.0] [07:20:59] RECOVERY - Puppet failure on tools-exec-gift is OK: OK: Less than 1.00% above the threshold [0.0] [07:36:31] PROBLEM - Puppet failure on tools-exec-14 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [08:06:34] RECOVERY - Puppet failure on tools-exec-14 is OK: OK: Less than 1.00% above the threshold [0.0] [08:48:37] 3Tool-Labs: Provide page view metrics for individual tools on toollabs - https://phabricator.wikimedia.org/T87001#981290 (10yuvipanda) 3NEW [09:54:02] 3Wikimedia-Labs-wikitech-interface, operations: Interwiki map broken on wikitech - https://phabricator.wikimedia.org/T43786#981362 (10jayvdb) 5Open>3Resolved a:3jayvdb Appears the local interwikis are now working and in the sites interwikimap. [09:57:35] PROBLEM - Puppet failure on tools-exec-14 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [10:27:33] RECOVERY - Puppet failure on tools-exec-14 is OK: OK: Less than 1.00% above the threshold [0.0] [10:28:42] 3Tool-Labs-tools-Erwin's-tools: xwiki.php not working - https://phabricator.wikimedia.org/T86976#981436 (10Billinghurst) Toolserver (and its content) is gone. If it wasn't retrieved, it is gone. About the only hope that you have is that it is on the ToolLabs and has not been converted to work Need to ping one... [10:29:04] 3Labs-Team: New disk partition scheme for labs instances - https://phabricator.wikimedia.org/T87003#981437 (10yuvipanda) 3NEW [10:29:41] 3Labs-Team: New disk partition scheme for labs instances - https://phabricator.wikimedia.org/T87003#981447 (10yuvipanda) Let's do this for the Jessie images to start with, and then make precise / trusty images like this too. [10:31:02] 3Labs-Team: New disk partition scheme for labs instances - https://phabricator.wikimedia.org/T87003#981449 (10yuvipanda) [10:41:21] 3Wikimedia-Labs-Infrastructure, Labs-Team: port 22 blocked in some cases despite being allowed with security groups - https://phabricator.wikimedia.org/T86143#981469 (10akosiaris) Is there anything else left to do for this ? [10:57:51] 3Wikimedia-Labs-Infrastructure, Labs-Team: port 22 blocked in some cases despite being allowed with security groups - https://phabricator.wikimedia.org/T86143#981490 (10yuvipanda) Still happening. shinken-server-01 can't access deployment-mediawiki02 for example. [11:08:21] 3Wikimedia-Labs-Infrastructure, Continuous-Integration, Labs-Team: OpenStack API account to control `contintcloud` labs project - https://phabricator.wikimedia.org/T86170#981512 (10hashar) I have created a first draft of the architecture at https://www.mediawiki.org/wiki/Continuous_integration/Architecture/Isola... [11:08:25] Coren: ping [11:08:28] when around [11:08:42] Coren: can you create a new queue called generic-webserver (let’s not call it tomcat anymore) and add the new server to it? [11:23:21] 3Wikimedia-Labs-Infrastructure, Labs-Team: port 22 blocked in some cases despite being allowed with security groups - https://phabricator.wikimedia.org/T86143#981533 (10akosiaris) After @Yuvipanda informed that indeed the above merge did not solve the issue I investigated deployment-mediawiki01 and deployment-me... [11:38:28] PROBLEM - Puppet failure on tools-exec-14 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [11:45:11] 3Wikimedia-Labs-Infrastructure, Labs-Team: port 22 blocked in some cases despite being allowed with security groups - https://phabricator.wikimedia.org/T86143#981567 (10yuvipanda) So I see that the problem is in all hosts that had base::firewall applied at some point and do not any more (integration and deployme... [11:56:31] 3Wikimedia-Labs-Infrastructure, Labs-Team: port 22 blocked in some cases despite being allowed with security groups - https://phabricator.wikimedia.org/T86143#981588 (10yuvipanda) So, to make the beta hosts pristinis-sh again.. ```$ sudo su # rm -rf /etc/ferm; dpkg -P ferm; iptables -P INPUT ACCEPT; iptables -F... [12:01:23] 3Wikimedia-Labs-Infrastructure, Labs-Team: port 22 blocked in some cases despite being allowed with security groups - https://phabricator.wikimedia.org/T86143#981589 (10akosiaris) Just ran sudo salt '*' cmd.run 'service ferm stop; apt-get purge ferm -y; rm -rf /var/cache/ferm /etc/ferm' so all beta hosts shou... [12:03:42] 3Labs-Team: New disk partition scheme for labs instances - https://phabricator.wikimedia.org/T87003#981590 (10yuvipanda) p:5Triage>3Normal [12:23:35] RECOVERY - Puppet failure on tools-exec-14 is OK: OK: Less than 1.00% above the threshold [0.0] [12:25:47] 3Labs-Team: Set up ssh checks for all labs hosts - https://phabricator.wikimedia.org/T86027#981623 (10yuvipanda) [12:25:50] 3Wikimedia-Labs-Infrastructure, Labs-Team: port 22 blocked in some cases despite being allowed with security groups - https://phabricator.wikimedia.org/T86143#981620 (10yuvipanda) 5Open>3Resolved a:3yuvipanda Yay, that is fixed now :) [12:28:41] 3Labs-Team: Set up ssh checks for all labs hosts - https://phabricator.wikimedia.org/T86027#981627 (10yuvipanda) 5Open>3Resolved a:3yuvipanda [12:45:18] 3Wikimedia-Labs-Infrastructure: "Stale file handle" for /public/dumps/ - https://phabricator.wikimedia.org/T87013#981655 (10Nemo_bis) 3NEW [13:19:30] PROBLEM - Puppet failure on tools-exec-14 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [13:20:37] 3Wikimedia-Labs-Infrastructure, Labs-Team: port 22 blocked in some cases despite being allowed with security groups - https://phabricator.wikimedia.org/T86143#981728 (10akosiaris) [13:31:30] 3Tool-Labs: Open Grid Engine Job dumps core (node) - https://phabricator.wikimedia.org/T86905#981771 (10edsu) p:5Triage>3Volunteer? [13:43:09] 3Wikimedia-Labs-Infrastructure, Labs-Team: port 22 blocked in some cases despite being allowed with security groups - https://phabricator.wikimedia.org/T86143#981796 (10hashar) Thanks for the cleanup and investigation! [14:04:33] RECOVERY - Puppet failure on tools-exec-14 is OK: OK: Less than 1.00% above the threshold [0.0] [14:10:16] 3Tool-Labs: Wikimedia-hosted OpenStreetMap (OSM) - bw_mapnik tiles issue - https://phabricator.wikimedia.org/T86932#981840 (10Aklapper) http://tools.wmflabs.org/osm/ is a 404, where is that hosted? [15:15:33] PROBLEM - Puppet failure on tools-exec-14 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [15:45:30] RECOVERY - Puppet failure on tools-exec-14 is OK: OK: Less than 1.00% above the threshold [0.0] [16:16:28] PROBLEM - Puppet failure on tools-exec-14 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [16:23:20] paravoid: have any other suggestions for how to get us booting with lvm? I've tried both of your suggestions, no dice :( [16:23:50] something else is going on then [16:24:10] "Timed out waiting for device dev-mapper-vd\x2dlog.device" [16:29:48] 3Tool-Labs: Provide page view metrics for individual tools on toollabs - https://phabricator.wikimedia.org/T87001#982009 (10Ironholds) "non-bot"? [16:29:58] 3Tool-Labs: Provide page view metrics for individual tools on toollabs - https://phabricator.wikimedia.org/T87001#982010 (10Ironholds) p:5Triage>3Volunteer? [16:40:45] andrewbogott: \x2d? that's weird, considering the -'s before that are just dashes [16:46:31] RECOVERY - Puppet failure on tools-exec-14 is OK: OK: Less than 1.00% above the threshold [0.0] [16:57:33] PROBLEM - Puppet failure on tools-exec-14 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [16:59:23] i think we should mention it somewhere in the tools help that database connections are not persistent (anymore) and you should use them only when you need them (?) [17:04:32] andrewbogott: regardless of what we end up doing, could you prepare an image with no separate /var or /var/log, to see if this is still causing trouble? [17:05:03] paravoid: It's definitely the case that without lvm everything works fine. [17:05:12] ok [17:05:30] that's what the box that I'm using to build the new images is. [17:05:43] how can I test/debug this myself? [17:05:59] oh I just saw your whirly-fever comment [17:06:11] no worries, we can do this next week [17:06:13] in person even [17:06:17] go rest :) [17:06:36] paravoid: This is how to build a new image: https://wikitech.wikimedia.org/wiki/OpenStack#Building_a_Debian_image [17:06:51] And you can probably test by launching locally w/curses; no need to actaully install them in glance or anything. [17:07:04] build/test is pretty painless, only takes 10-15 to build a new image. [17:08:12] paravoid: the source files for the image are in /etc/bootstrapvz. They're puppetized, so disable puppet before tinkering. [17:08:41] do you have the image that fails there already? [17:08:50] (And, thanks for your concern :) It's no really that I'm too sick to look at a screen, just too sick to think my way through any problem with stack depth > 1 ) [17:08:51] debian-jessie-amd64-150115.qcow2 [17:08:58] yes, in /target [17:09:04] right [17:09:27] There's also a boot log pasted to that bug from a slightly different image. /and that image is installed in glance as 'debian-8.0-jessie (testing)' [17:09:33] so many test surfaces :) [17:10:13] thanks for looking [17:10:38] where do you boot those images? [17:10:49] the "locally" part isn't that VM, is it? [17:10:56] do we support nested kvm? [17:12:00] Hm… I think it works to boot on that vm, yeah. If not, there's a copy in /tmp on virt1000 [17:13:29] well, qemu-system-x86_64 is clearly not installed on that VM. I must've been working on a different build box when I documented that. [17:13:47] but, I'm sure that nested kvm /does/ work if the right things are installed. [17:18:25] no, nested kvm doesn't work there [17:20:50] virt1000 has no kvm installed either :) [17:22:30] RECOVERY - Puppet failure on tools-exec-14 is OK: OK: Less than 1.00% above the threshold [0.0] [17:22:37] well, what the heck, I'm sure I've done that on a labs box. [17:22:44] Maybe it was on an ubuntu instance [17:34:30] andrewbogott: so I booted the image that's in virt1000's /tmp [17:34:35] oh you left [18:14:05] 3Tool-Labs: WIWOSM not working in Wikipedias - https://phabricator.wikimedia.org/T87038#982314 (10Aklapper) Hi, can you please provide a link that shows the problem? Wondering if https://lists.wikimedia.org/pipermail/labs-l/2015-January/003248.html and https://lists.wikimedia.org/pipermail/labs-l/2015-January/00... [18:24:09] 3Tool-Labs: WIWOSM not working in Wikipedias - https://phabricator.wikimedia.org/T87038#982351 (10pere_prlpz) I think the problem is unrelated to manteinance issues because links from categories stopped working at least weeks ago and maps in articles have at least some days ago. Links: - For WIWOSM maps in arti... [18:32:54] 3Tool-Labs: WIWOSM not working in Wikipedias - https://phabricator.wikimedia.org/T87038#982390 (10pere_prlpz) And [[T86932]] seems to be unrelated to these issues, although I don't know how OSM-related services are organized in MediaWiki and the labs, nor if all them rely in a common broken resource that might b... [18:45:24] 3Tool-Labs: WIWOSM not working in Wikipedias - https://phabricator.wikimedia.org/T87038#982487 (10Vriullop) WIWOSM is explained at https://de.wikipedia.org/wiki/Wikipedia:WikiProjekt_Georeferenzierung/Anwendungen/OpenStreetMap/en, loaded as a gadget in some projects or by https://de.wikipedia.org/wiki/MediaWiki:... [18:48:06] 3Tool-Labs: Wikimedia-hosted OpenStreetMap (OSM) - bw_mapnik tiles issue - https://phabricator.wikimedia.org/T86932#982538 (10xkomczax) The URL of tile is for example http://a.tiles.wmflabs.org/bw-mapnik/12/2239/1397.png . Over here you can see pink raceway next to the Lipová village and few green areas (= prote... [19:18:34] PROBLEM - Puppet failure on tools-exec-14 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [19:43:32] RECOVERY - Puppet failure on tools-exec-14 is OK: OK: Less than 1.00% above the threshold [0.0] [20:39:35] PROBLEM - Puppet failure on tools-exec-14 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [20:39:38] how come there's no private ip here: https://wikitech.wikimedia.org/wiki/Nova_Resource:I-000006d8.eqiad.wmflabs [20:39:38] should that be 10.68.16.120? [20:39:43] (from -operations) [20:40:58] I saw that it's marked as building there still [20:41:00] and yet: 64 bytes from deployment-parsoid05.eqiad.wmflabs (10.68.16.120): icmp_req=1 ttl=64 time=0.414 ms [21:24:33] RECOVERY - Puppet failure on tools-exec-14 is OK: OK: Less than 1.00% above the threshold [0.0] [21:35:31] PROBLEM - Puppet failure on tools-exec-14 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [22:25:32] RECOVERY - Puppet failure on tools-exec-14 is OK: OK: Less than 1.00% above the threshold [0.0] [22:36:33] PROBLEM - Puppet failure on tools-exec-14 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [23:06:33] RECOVERY - Puppet failure on tools-exec-14 is OK: OK: Less than 1.00% above the threshold [0.0] [23:17:30] PROBLEM - Puppet failure on tools-exec-14 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [23:44:14] 3Wikimedia-Labs-wikistats: Fix all the Wikia stats - https://phabricator.wikimedia.org/T61943#983322 (10Nemo_bis) \o/ [23:58:20] !log wikimania-support Added Dduvall as project member [23:58:27] marxarelli: ^ [23:58:33] bd808: gracias! [23:58:52] The instance is bd808-vagrant [23:59:12] and the files are in /srv/vagrant/support/... something [23:59:52] rad [23:59:57] looks like the !log bot is awol