[00:00:14] RECOVERY - Puppet failure on tools-webgrid-generic-02 is OK: OK: Less than 1.00% above the threshold [0.0] [00:00:28] RECOVERY - Puppet failure on tools-webgrid-05 is OK: OK: Less than 1.00% above the threshold [0.0] [00:01:50] PROBLEM - Puppet failure on tools-exec-20 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [00:03:36] RECOVERY - Puppet failure on tools-webgrid-generic-01 is OK: OK: Less than 1.00% above the threshold [0.0] [00:07:48] RECOVERY - Puppet failure on tools-webgrid-01 is OK: OK: Less than 1.00% above the threshold [0.0] [00:08:54] RECOVERY - Puppet failure on tools-webgrid-03 is OK: OK: Less than 1.00% above the threshold [0.0] [00:11:41] RECOVERY - Puppet failure on tools-webgrid-07 is OK: OK: Less than 1.00% above the threshold [0.0] [00:13:13] RECOVERY - Puppet failure on tools-webgrid-08 is OK: OK: Less than 1.00% above the threshold [0.0] [00:23:20] PROBLEM - Puppet failure on tools-exec-1401 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [00:31:50] RECOVERY - Puppet failure on tools-exec-20 is OK: OK: Less than 1.00% above the threshold [0.0] [00:47:00] OK, why is my tool failing to start? [00:48:17] Magog_the_Ogre: hi. which one is this? [00:50:55] nevermind, it started this time [00:50:58] it said Timeout: could not stop job in 30s [00:51:03] so I thought it was a startup error [00:51:34] ah, ok :) [01:47:41] wikibugs stopped responding again [02:10:19] Hi folks. I'm a new developer for a labs project, and I am wondering about storage. My VM includes 160GB of storage, but it's unclear where this is mapped in the filesystem. Does anybody know? [02:11:36] hi Shilad [02:11:52] hi! [02:12:08] Shilad: so by default it is unallocated, and you can allocate it by enabling the role role::labs::lvm::srv [02:12:29] Shilad: you can do so by going to wikitech.wikimedia.org/wiki/Special:NovaInstance, selecting your instance, clicking ‘configure’, and ticking the checkbox next to that one [02:12:44] Shilad: and then you run puppet (or wait 20 minutes) and you will have a /srv partition with the rest of your stuff [02:12:52] Awesome. Thanks. [02:12:54] Shilad: alternatively, you can just use lvm to create your partitions as you see fit [02:13:03] Shilad: you can run puppet with ‘sudo puppet agent -tv' [02:13:43] Do you know anything about it? Is it a reasonably-close-to-local SSD? Or is it NFS mounted? [02:14:09] Shilad: it’s basically a local spinning disk [02:14:23] Shilad: NFS is mounted in /data/project (and your home directory is NFS). use appropriately :) [02:14:38] YuviPanda: Great. Thanks! [02:14:48] Shilad: :) [02:26:51] !log tools created tools-exec-12{01-10} [02:27:20] Logged the message, Master [02:29:32] YuviPanda: what is the storage specification on m1 type instances? ("0 GB storage") [02:29:37] (03CR) 10Krinkle: [C: 031] Move Math repo from #mediawiki-visualeditor to #wikimedia-editing [labs/tools/grrrit] - 10https://gerrit.wikimedia.org/r/206041 (owner: 10Jforrester) [02:30:41] Negative24: what do you mean by ‘storage specification’? [02:31:06] Negative24: oh, that ‘0GB storage’. ignore that... [02:31:13] NFS? [02:31:14] Negative24: look at root partition size, that’s how much you’re getting for realz [02:31:24] Negative24: so your root partition is always local spinning disks [02:31:31] and then your /data/project and /home is NFS [02:31:38] figured as much [02:31:57] just wondering if storage is "more" NFS [02:32:58] nope [02:33:07] I guess it used to refer to Ceph [02:33:07] but we no longer use cept [02:33:19] ah [02:35:21] !log tools set tools-exec-12{01-10} to configure as exec nodes [02:35:26] Logged the message, Master [02:36:21] gets me every time: phabricator's git auth uses a different password. the "VCS password" [02:37:51] * Negative24 wonders why Phab has to make things complicated. "For security reasons..." they say... [02:38:44] but at least phab-02 works [02:54:55] PROBLEM - Puppet failure on tools-exec-1204 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [03:00:52] PROBLEM - Puppet failure on tools-exec-1205 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [03:01:52] PROBLEM - Puppet failure on tools-exec-1207 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [03:13:40] PROBLEM - Puppet failure on tools-exec-1206 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [03:16:17] PROBLEM - Puppet failure on tools-exec-1210 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [03:18:59] PROBLEM - Puppet failure on tools-exec-1209 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [04:03:02] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/TheMesquito was created, changed by TheMesquito link https://wikitech.wikimedia.org/wiki/Nova+Resource%3aTools%2fAccess+Request%2fTheMesquito edit summary: Created page with "{{Tools Access Request |Justification=Bot for general wikipeida-en stuff to join my channel on freenode |Completed=false |User Name=TheMesquito }}" [04:31:41] Welcome [04:31:57] Hello! [05:49:49] PROBLEM - Puppet failure on tools-mailrelay-01 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [06:44:36] PROBLEM - Puppet failure on tools-submit is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [06:47:18] Anyone around to poke wikibugs ? [07:09:36] RECOVERY - Puppet failure on tools-submit is OK: OK: Less than 1.00% above the threshold [0.0] [07:49:50] RECOVERY - Puppet failure on tools-mailrelay-01 is OK: OK: Less than 1.00% above the threshold [0.0] [08:38:14] PROBLEM - Puppet staleness on tools-shadow is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [43200.0] [09:11:01] andrewbogott_afk, YuviPanda: is something wrong with the new hardware? I just spawned debian instance "huggle-d2" and it's REALLY very slow. Just launching aptitude took almost 10 minutes [09:11:29] and it's obviously not IO but CPU which is slowing it down [09:11:41] it has 1 core only, but still, it shouldn't be so slow [09:11:58] what CPU these boxes have, e5? [09:12:32] oh, it's e3... same thing I use in my own servers :o [10:20:31] PROBLEM - Puppet staleness on tools-exec-catscan is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [43200.0] [10:21:45] PROBLEM - Puppet staleness on tools-exec-01 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [43200.0] [10:24:15] PROBLEM - Puppet staleness on tools-exec-cyberbot is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [43200.0] [10:26:13] PROBLEM - Puppet staleness on tools-exec-11 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [43200.0] [10:26:27] PROBLEM - Puppet staleness on tools-exec-05 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [43200.0] [10:26:43] PROBLEM - Puppet staleness on tools-exec-04 is CRITICAL: CRITICAL: 62.50% of data above the critical threshold [43200.0] [10:27:24] PROBLEM - Puppet staleness on tools-exec-09 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [43200.0] [10:28:00] PROBLEM - Puppet staleness on tools-exec-24 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [43200.0] [10:28:30] PROBLEM - Puppet staleness on tools-exec-22 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [43200.0] [10:29:16] PROBLEM - Puppet staleness on tools-exec-06 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [43200.0] [10:29:24] PROBLEM - Puppet staleness on tools-exec-23 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [43200.0] [10:31:52] PROBLEM - Puppet staleness on tools-exec-15 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [43200.0] [10:32:16] PROBLEM - Puppet staleness on tools-exec-gift is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [43200.0] [10:33:12] PROBLEM - Puppet staleness on tools-exec-wmt is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [43200.0] [10:33:34] PROBLEM - Puppet staleness on tools-exec-12 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [43200.0] [10:34:34] PROBLEM - Puppet staleness on tools-exec-21 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [43200.0] [10:34:35] PROBLEM - Puppet staleness on tools-exec-13 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [43200.0] [10:36:07] PROBLEM - Puppet staleness on tools-exec-08 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [43200.0] [10:36:50] PROBLEM - Puppet staleness on tools-exec-14 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [43200.0] [10:37:04] PROBLEM - Puppet staleness on tools-exec-07 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [43200.0] [10:37:53] PROBLEM - Puppet staleness on tools-exec-10 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [43200.0] [10:40:25] PROBLEM - Puppet staleness on tools-exec-02 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [43200.0] [11:36:50] PROBLEM - Puppet failure on tools-static-02 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [11:47:00] Coren_away, YuviPanda, andrewbogott_afk: someone here? The CPU problems on huggle-d2 still happen [11:47:15] it's insanely slow, execution of few instructions take minutes [12:22:37] petan: can you check which virt node it runs on ? [12:22:48] hmhmhm [12:22:50] I have an instance on beta that has slow downs, it is on virt1007 [12:22:55] @labs-info huggle-d2 [12:22:55] I don't know this instance, sorry, try browsing the list by hand, but I can guarantee there is no such instance matching this name, host or Nova ID unless it was created less than 38 seconds ago [12:23:03] no longer works :/ [12:23:12] hashar: where can I find it [12:23:14] even had the kernel to report "BUG: soft lockup - CPU#1 stuck for 23s!" [12:23:20] petan: wikitech! [12:23:29] just search for your instance name [12:23:41] [ 5153.930319] hrtimer: interrupt took 11107333 ns [12:23:45] I have this [12:23:47] the search result should have a page to something like https://wikitech.wikimedia.org/wiki/Nova_Resource:I-000000xxxx.eqiad.wmflabs [12:23:51] looks to me like kernel issue as well [12:24:22] labvirt1005 [12:24:47] hey [12:25:08] the cpu is really totally exhausted, cpuinfo say it's "2.4GHZ" but even running few instructions there takes forever [12:25:14] filling a bug [12:25:38] PROBLEM - Puppet staleness on tools-exec-20 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [43200.0] [12:26:48] petan: can you check dmesg ? [12:26:52] dmesg -T [12:30:21] petan: I have filled https://phabricator.wikimedia.org/T97520 - you are on CC - [12:30:31] thanks [12:30:42] petan: please look at dmesg -T [12:30:43] might help [12:31:03] yes I did, last message is [12:31:05] [Wed Apr 29 10:05:43 2015] hrtimer: interrupt took 11107333 ns [12:31:11] nothing else there [12:31:24] everything before this looks like normal boot messages [12:31:50] well [12:31:51] [Wed Apr 29 08:40:45 2015] intel_rapl: RAPL domain package detection failed [12:31:52] [Wed Apr 29 08:40:45 2015] intel_rapl: RAPL domain core detection failed [12:31:53] [Wed Apr 29 08:40:45 2015] intel_rapl: RAPL domain uncore detection failed [12:31:54] this too [12:32:01] 1005 is suspected of having issues; that is why Andrew has been avoiding it in migrations. Apparently, the suspicions were warranted. [12:32:39] well, it's not like it wouldn't work, it's just extremely slow, it would be cool if someone could ssh to bare metal and check the current CPU utilization of KVM [12:32:48] maybe there is just too many instances running on it? [12:33:00] does it have VT-x and VT-d enabled on CPU? [12:34:11] petan: Do try to shut down and restart the instance - we think some of the migrations towards 1005 left the vms in an odd state. [12:34:16] if the CPU utilization is indeed huge (too many busy instances), then it makes sense for it to run slow [12:34:21] ok [12:34:58] I issued "reboot" let's see if it's gonna do something :P it just locked up for now [12:35:09] Coren_away: would virt1007 be affected as well? [12:35:15] will try to reboot it [12:35:35] the task is https://phabricator.wikimedia.org/T97520 :] [12:36:08] hashar: We noticed nothing of the sort, but I suppose it's possible live migration just seemed to work right. :-( I'll notify andrew as soon as he wakes. [12:36:22] I am going to restart the instance [12:36:26] might fix it [12:38:38] I am just wondering how many instances are on virt5 now? [12:38:50] 1005 [12:39:24] petan: Few. That's not the issue. [12:40:52] hashar: is your instance ubuntu or debian? [12:41:04] ubuntu-14.04-trusty (deprecated 2015-02-03) [12:42:12] ok it's back up [12:43:15] cpu is still kind of fucked up [12:43:35] Stepping: 1 :P no wonder [12:43:46] it's first version intel made, now they are fixing the bugs... [12:44:09] * Coren_away chuckles. [12:44:19] They be virtual cpus. They're all model 42 stepping 1. :-) [12:44:58] Aha. Looks like hardware issue on virt1005 [12:45:00] idk, lscpu usually shows the real cpu even on virtual machines, at least for all my machiines where I run some virtual machines [12:45:06] [422616.696930] EDAC MC0: 13926 CE memory read error on CPU_SrcID#0_Channel#1_DIMM#2 (channel:1 slot:2 page:0x1c9faa4 offset:0xfc0 grain:32 syndrome:0x0 - OVERFLOW area:DRAM err_code:0001:0091 socket:0 channel_mask:2 rank:8) [12:45:31] hmm... either memory or controller :/ [12:45:32] Actual cpu is model 62 stepping 4. [12:45:39] aha [12:49:33] We'll have to evacuate and shut that one down asap. [12:50:45] hashar: virt1007 is old hardware and old OS; but you can't currently trust wikitech for "where are instances" because while migration is going on that is only irregularily updated. [12:51:07] ohhhh [12:52:03] | b26e5c79-7190-431c-9fc9-e12bf05c0cd6 | deployment-parsoid05 | labvirt1005 | ACTIVE | [12:52:34] To noone's actual surprise. [12:53:12] As soon as Andrew wakes we'll make a plan to evacuate the failing hardware asap [12:53:32] Said plan may include yelling at the manufacturer a lot. [12:53:42] This is brand new hardware. [12:59:13] ahh [13:00:32] maybe that is just that instance having issue? [13:01:29] ah I have just seen your subtask [13:02:09] andrewbogott_afk: Ping! Wakeupworthy I believe. [13:02:27] do you really think this wakes him up? [13:02:36] o.o [13:02:51] petan: never underestimate our ops! [13:03:15] well I mean, pinging someone on irc? :D he had to be sleeping with laptop under pillow [13:03:37] no that I wouldn't :P [13:12:01] ok... [13:12:16] The backscroll is full of stuff from yuvi rebuilding things. Coren, can you summarize? [13:12:40] andrewbogott: User reports of instances being crappy lead me to [13:12:41] * hashar prepares coffee and donuts [13:12:54] https://phabricator.wikimedia.org/T97521 [13:13:08] Once I realized they were all on the same host [13:13:41] some instance started having very slow/stuck CPU petan and I have two examples that each run on labvirt1005. [13:14:29] So, that means that a single dimm is unreadable? [13:15:17] That is — do you judge that to be a hardware issue limited to a particular chunk of memory, or is it something systematic like the bus introducing errors? [13:15:34] andrewbogott: It's not immediately clear whether that's the dimm or the south bridge having issues; lemme grep the logs to see. [13:16:38] And, one of the symptoms is cpu spiking? (Because that may to be happening on 1003 as well…) [13:16:41] * andrewbogott checks logs on 1003 [13:17:11] andrewbogott: I see only errors for Channel 1, but more than one dimm reported. [13:17:50] andrewbogott: So "unclear". One dimm can cause issue on the channel - that wouldn't be the first time I see this - or there is an electrical issue with the bank. [13:17:57] Or the south bridge is busted. [13:18:03] for what it is worth, I restarted deployment-parsoid05 from the wikitech console and it does not come back. Error: Failed to terminate process 5990 with SIGKILL: Device or resource busy [13:18:04] I can't really say that CPU is spiking, it's just very slow in processing stuff, eg. instance is running VERY slow [13:18:04] I think that's "service call" time [13:18:38] my instance rebooted fine but is still very slow, try yourself it's huggle-d2 [13:18:53] just starting aptitude takes 2 minutes for it to load [13:19:01] Coren: I’m no good at reading dmesg, are there timestamps? Can you tell how long this has been happening? [13:20:18] If we reboot will it detect the failure during post and sequester the bad memory? Or will it detect the failure and refuse to boot? [13:20:20] andrewbogott: syslog is full of it going back to Apr 23 04:40:45 [13:20:50] btw Coren if wikipedia is not mistaken, CPU's hostbridge to handle RAM is called "northbridge" southbridge is for periferals, PCI cards and so on [13:20:52] andrewbogott: I don't know - depends on how the BIOS works and how torough its check is. [13:21:11] petan: Erm, yes. My compass is backwards this morning. :-) [13:21:48] petan: ok if I stop your instance? I want to see if 1005 is healthy enough to support an evacuation [13:22:03] it's even ok to nuke it, I just created it, totally empty [13:22:15] great, a perfect test subject [13:28:25] petan: ok, how does it look? [13:34:56] looks damn fast [13:35:37] great. [13:35:43] OK… next up, hashar, did you have a troubled instance? [13:36:19] yup deployment-parsoid05 [13:36:31] I tried to restart it via wikitech but it does not come back [13:36:56] hashar: If I accidentally murder it, would that be ok? [13:36:59] it is error status and horizon reports Failed to terminate process 5990 with SIGKILL: Device or resource busy [13:37:08] it runs the parsoid service [13:37:12] not sure how long it takes to rebuild it [13:45:00] Coren: I’m thinking that I’ll shrink these instances as part of the migration script. But I’m /also/ thinking that I’ll try suspending rather than stopping them… [13:45:07] Do you think that shrinking a suspended instance is bad news? [13:46:19] brb [13:46:28] andrewbogott: It *probably* isn't - by definition, the zeroed out blocks will remain zeroed blocks and - afaik - suspended instances does not keep the image file open [13:46:37] andrewbogott: But I'd test on a non-precious instance first. [13:46:41] * andrewbogott tries it [13:47:59] Second thing I'd worry about is the possibility of corruption of instances. It's not very likely, but it's possible. [13:50:07] Test seems to’ve worked fine [13:56:36] Goes to grab caffeine and food [14:17:01] oh for god sake [14:17:16] integration-saltmaster | labvirt1005 [14:17:16] :( [14:17:47] You know, I thought virt1005 was already depooled because you had already considered it suspect. [14:17:58] * Coren must be remembering wrong. [14:20:54] RECOVERY - Host deployment-parsoid05 is UP [14:21:02] there is some hope! [14:27:49] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/TheMesquito was modified, changed by Tim Landscheidt link https://wikitech.wikimedia.org/w/index.php?diff=156413 edit summary: [14:33:25] Coren: I removed it from the migration script but didn’t depool it from the scheduler. [14:34:18] Ah. [14:34:53] although obviously that would’ve been smarter [14:37:35] Coren: can we drain tools-exec-04 and then just delete it? [14:40:34] * Coren checks room on the queues. [14:40:55] Yeah, should be okay - especially as Yuvi is making a new batch. [14:41:21] !log tools disabled -exec-04 (going away) [14:41:29] Logged the message, Master [14:42:56] It's being difficult. Jobs are long to die. [14:43:07] There we go. [14:44:13] andrewbogott: can you confirm you are migrating instances out of virt1005 ? [14:44:18] !log tools -exec-04 drained; removed from queues. Rest well, old friend. [14:44:23] Logged the message, Master [14:44:25] deployment-parsoid05 went back up and is working just fine now as far as I can tell [14:44:34] andrewbogott: -exec-04 is now okay for reaping. [14:45:30] andrewbogott: Dunno if it's because of all the added activity from the evacuation, but the rate of ram error seems to be increasing. [14:47:14] !log tools deleting tools-exec-04 [14:47:18] Logged the message, dummy [14:48:55] Coren: so, I’m using two different migration scripts, one which suspends/copies and one which stops/copies/shrinks [14:49:13] Because I’m too nervous to do suspend/copy/shrink [14:49:28] So, the other exec nodes will be suspended and copied, when I get to them. They don’t need shrinking anyway since they’re new [14:49:43] wmf [14:49:52] Of course the copying takes ages [14:50:09] Sudden, obvious application for 10g ethernet [14:50:18] *wfm [14:50:43] WikiFedia Moundation [14:50:47] PROBLEM - Host tools-exec-04 is DOWN: CRITICAL - Host Unreachable (10.68.16.33) [14:51:35] and wikitech updates its host/virt information just fine apparently https://wikitech.wikimedia.org/w/index.php?title=Nova_Resource:I-000006d8.eqiad.wmflabs&diff=156398&oldid=156380 [14:51:39] ^O^ [14:52:08] hashar: It updates on an instance state change. [14:52:15] So when I do live migration, that’s invisible and it doesn’t update [14:52:25] but a cold-migration that involves a restart, that should cause an update. [14:52:50] I need to hack in a big red button that forces an update everywhere [14:53:30] I’ve already implemened daily polling updates but the subsystem that requires seems to be broken in icehouse :( [14:55:08] hashar: I just moved deployment-test. I’m going to do the other deployment instances now if that’s OK [14:55:20] andrewbogott: yeah sure go! :] [14:55:30] ok, next up is deployment-cache-bits01 [14:55:39] thank you for taking care of all that mess on wake up [14:56:00] meanwhile, I am wondering whether you guys run a memory check when receiving new hardware ( memtest86 ? ) [14:56:41] I don’t. These boxes take minutes to start up, I figured that was because they were running a memory check on each startup [14:56:43] I don't think that part of sop but you'd want to ask our DC meisters. [14:57:34] I am not sure how much it is worth it [14:57:48] but potenitally a live cd that boot on memtest86 would be a good check once a machine is rackd [14:57:49] racked [14:57:57] (live usb key) [15:54:22] Coren: does labs zero rev_text_id ? [15:55:00] I beleive it does, iirc [15:56:06] grr, then thats not the root cause [15:56:29] root cause of? [15:56:50] RECOVERY - Puppet failure on tools-static-02 is OK: OK: Less than 1.00% above the threshold [0.0] [15:57:20] Coren: https://phabricator.wikimedia.org/T97532 [15:59:27] fwiw it's zeroed out in labs because it's worthless (no access to the text store) and can reveal that a suppressed edit has taken place in some cases where it shouldn't. [15:59:53] But if you need queries run in prod to validate something in re this bug you can bug one of us. [16:01:00] Coren: wasnt aware of the zeroing and thought that might be the cause of the bug. but its a false positive [16:01:00] Betacommand: I posted the root cause on the other bug [16:03:56] legoktm: which other bug? [16:04:09] Betacommand: the tracking one, https://phabricator.wikimedia.org/T97536 [16:05:24] legoktm: using the oldid= url hack they should still be visible though, shouldnt they? [16:06:20] valhallasw`cloud: think wikibugs died [16:06:21] https://en.wikipedia.org/?oldid=659870568 apparently not [16:06:59] legoktm: correct, which is what has be scratching my head [16:41:26] Betacommand: let me give it a nudge... [16:43:41] !log tools.wikibugs valhallasw: Deployed 1d785dc3ad22a434749f8ec0d466180f3de9ea52 channels: Continuous-Integration is now Continuous-Integration-Infrastructure wb2-phab, wb2-irc [16:43:46] Logged the message, Master [16:44:43] 10Wikibugs: wikibugs test bug - https://phabricator.wikimedia.org/T1152#1245986 (10valhallasw) dum [16:44:46] Betacommand: ^ [16:44:53] thanks for mentioning it [16:45:24] valhallasw`cloud: np [16:45:37] thanks for fixing [16:57:49] PROBLEM - Puppet failure on tools-static-02 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [17:19:01] I have a question about i18n, what´s the better approach nowadays? Setup separate databases per language, or use Extension, or both? [17:19:14] bd808? ^ [17:27:19] renoirb: i18n for the wiki content? [17:27:28] bd808?, yes [17:27:36] As in content translation [17:28:58] Hmmm... I know we generally use separate wikis but there are places like meta and commons that have mixed languages. You might ask the nice folks in #mediawiki-i18n for best practices. [17:57:25] YuviPanda: let me know when you’re up and working? [18:03:16] OMG I found a solution to the LDAP thing. It is *so* ugly. :-) [18:05:22] Turns out the solution to keeping the cake and eating it too is to have a fake cake. :-) [18:06:57] what was the problem? [18:08:19] Platonides: Not directly related to labs; it's tech debt around an evil hack on the labstores that prevents us from using the same user settings as other places in prod. The short of it: NFS needs to have the LDAP groups to work right (because of the 8-group limit) but that conflics with our normal puppet admin user distribution. [18:08:32] Which is bad. [18:10:28] recompiling the kernel with a bigger group limit? [18:10:35] when something like the RAM hardware fail happens like on virt1005, i guess that doesn't really have an actionable, right? it's more like "shit happens" [18:10:41] That's not a kernel issue - that's a protocol issue. [18:11:02] mutante: "Yell at the vendor a lot" is an actionable. This is new hardware. :-) [18:11:05] the only thing i could think of would be "switch hardware vendor" but [18:11:13] that seems like "could happen with anyone" [18:11:22] Coren: heh, yea [18:11:40] well, i guess it all depends if it happens again [18:13:03] "[NFS] AUTH_SYS started off with 8, then went to 12, and finally settled on 16 supplemental group identifiers" [18:13:56] there's a fixed limit even on the rfc1831 :( [18:15:46] I know. The solution is to let the server handle perms. That works fine, so long as the server actually knows what the group membership are. [18:16:36] Turns out I was able to convince nslcd to provide *numeric* usernames for ldap entries so that they do not conflict with /etc/passwd entries. [18:16:41] Fake cake. :-) [18:17:28] And since the usernames are, in fact, the user IDs that works even if the tool is trying to be smart and recognize when you're giving a uid and not a username. [18:17:49] (Because either are the same) [18:24:48] hehe [18:31:27] 10Tool-Labs: Get rid of jlocal - https://phabricator.wikimedia.org/T95796#1246376 (10Ricordisamoa) [19:04:34] andrewbogott: am up now [19:04:36] * YuviPanda reads backlog [19:06:52] YuviPanda: I have a script moving most instances off of labvirt1005. I haven’t touched tools instances, they are: https://dpaste.de/NOM8 [19:07:07] Question is — those exec nodes aren’t pooled, are they? So I can just delete ‘em? [19:07:14] andrewbogott: you can syes [19:07:30] andrewbogott: let me switchover static-02 [19:07:44] @seen hashar [19:07:44] mutante: Last time I saw hashar they were quitting the network with reason: Remote host closed the connection N/A at 4/29/2015 3:48:44 PM (3h18m59s ago) [19:07:48] andrewbogott: mailrelay isn’t being used atm but valhallasw`cloud did some work on it yesterday - so would be nice to keep it alive [19:07:52] YuviPanda: shall I just delete the exec nodes now? [19:08:01] YuviPanda: ok, great, I will cold migrate it. [19:08:10] andrewbogott: yeah you can. [19:08:21] eh, feel free to kill mailrelay; it's in a broken state and I haven't done anything else than the initial puppet run [19:08:29] valhallasw`cloud: sure? [19:08:33] yes [19:08:40] ok, this is easy! [19:08:48] * andrewbogott deletes everything [19:09:05] I'm thinking of actually rebuilding it on precise for now, so we can depool -mail earlier [19:09:34] but YuviPanda probably disagrees with that ;-) [19:11:17] I'm not going to work on it today anyway, today it's LaTeX cursing day instead of puppet cursing day [19:11:50] PROBLEM - Host tools-exec-1201 is DOWN: CRITICAL - Host Unreachable (10.68.16.133) [19:11:51] !log tools failed over tools-static to tools-static-01 [19:11:57] Logged the message, Master [19:12:10] PROBLEM - Host tools-exec-1208 is DOWN: CRITICAL - Host Unreachable (10.68.17.119) [19:12:24] andrewbogott: you can cold migrate or kill tools-static-02. it’s trivially recreatable [19:12:34] PROBLEM - Host tools-exec-1401 is DOWN: CRITICAL - Host Unreachable (10.68.16.151) [19:12:38] PROBLEM - Host tools-exec-1203 is DOWN: CRITICAL - Host Unreachable (10.68.17.83) [19:12:44] PROBLEM - Host tools-exec-1202 is DOWN: CRITICAL - Host Unreachable (10.68.17.49) [19:12:47] I’d prefer to kill, since bandwidth for copying is the bottleneck at the moment. [19:12:54] YuviPanda: safe to kill now or shall I wait? [19:12:56] andrewbogott: feel free to kill. [19:13:20] * andrewbogott is nostaligic for the ciscos [19:13:25] heh [19:14:00] PROBLEM - Host tools-mailrelay-01 is DOWN: CRITICAL - Host Unreachable (10.68.16.57) [19:14:15] YuviPanda: ok, everything is deleted. If you schedule new instances they’ll be scheduled somewhere not labvirt1005. [19:14:39] Thank you for rebuilding everything, this is going to save a bunch of time [19:15:25] andrewbogott: :) cool! [19:15:32] andrewbogott: should I start rebuilding now or wait? [19:15:45] I’m not sure. [19:15:53] Wait, unless tools is blocked for want of exec nodes. [19:16:02] YuviPanda: and objections against rebuilding mail with precise for now? that also seperates the question 'is everything puppetized' from 'does it work on trusty' [19:16:12] valhallasw`cloud: yeah, I agree. [19:16:24] andrewbogott: none deleted atm were pooled, so is ok. [19:16:35] andrewbogott: I want to rebuild at least tools-static-02. [19:16:39] sure, go ahead [19:17:07] andrewbogott: but all the current exec nodes have puppet disabled on them because there are destructive-ish changes live. I want to get on it sooner than later - puppet runs on a fresh exec instance take like hours [19:18:10] YuviPanda: There’s no real reason not to start building them now, other than my being paranoid. [19:18:21] The sooner you kill the old exec nodes the sooner that disk space is freed up. [19:18:24] yeah [19:18:43] So go ahead and build ‘em if that's next on your list. [19:19:59] andrewbogott: alright. [19:20:20] andrewbogott: what’re we going to do about the new nodes? scream at vendor? [19:20:20] err [19:20:22] new node [19:20:39] YuviPanda: scream at vendor, replace dimms, approach with caution. [19:20:45] alright [19:20:57] andrewbogott: we have enough room left even without that, right? [19:21:06] maybe! [19:21:08] Coren: sorry I didn’t see your swap patch and wrote my own :| [19:21:12] haha :) [19:21:39] o_O. I'm pretty sure I told you I was about to write it, and pointed it at you when I was done. :-P [19:22:58] They are suspiciously simiiar. :-) [19:23:29] Coren: yeah :| I forgot the mount, saw yours and then wrote the mount [19:23:47] Coren: I saw you mention that you were going to write it, and I thin you added me as reviewer before you left but I didnt’ see gerrit queue until it was too late [19:23:49] sorry about that [19:24:00] Things happen. :-) [19:28:46] !log tools recreated tools-static-02 [19:28:51] Logged the message, Master [19:30:07] !log tools set appopriate classes for recreated tools-exec-12* nodes [19:30:11] Logged the message, Master [19:32:55] Coren: re: the labs NFS puppet change, paravoid / bblack (I think those were the ones who -1’d the bash script?) should take a look maybe? I still do not feel comfortable enough with NFS / lvm snapshots to +1 it [19:33:17] !log re-created tools-mailrelay-01 with precise: [[Nova_Instance:i-00000bca.eqiad.wmflab]] [19:33:18] Coren: same for the ldap change too [19:33:18] !log tools re-created tools-mailrelay-01 with precise: [[Nova_Instance:i-00000bca.eqiad.wmflab]] [19:33:32] re-created is not a valid project. [19:33:36] Logged the message, Master [19:33:43]