[00:16:43] maplebed: my cron spam has actually tripled [00:16:52] bah! [00:17:04] new error tho [00:17:06] Failed to instantiate parser (line 197): 'module' object has no attribute 'SwiftProxyLogtailer' [00:17:19] hrmph. [00:17:22] I did 2>/dev/null [00:17:30] but I guess it's sending shit to stdout, not just stderr. [00:17:44] ok, ok, I'll send both to the great bitubucket in the sky. [00:17:56] excellent. output is for the weak anyways [00:18:39] New patchset: Bhartshorne; "shush!" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2247 [00:18:58] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2247 [00:19:02] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2247 [00:19:03] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2247 [00:20:06] nimish_g: ok, wait past the next 5m boundary then please let me know if it's still going. [00:20:19] k [00:51:36] New patchset: Ottomata; "Scrapped Variable class." [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2248 [00:51:38] New patchset: Ottomata; "Removing directories. A bit cluttery, I'll re-add when needed. Also,, backend is not a python package" [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2249 [00:51:39] New patchset: Ottomata; "Mm, feeling good! AccessLogPipeline now able to be used without extending the class. Mmmm, prettier interface!" [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2250 [00:53:17] New patchset: Ottomata; "Removing unused user_agent1.py file. user_agent.py is left around for historical purposed until I feel ready to remove it." [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2251 [01:05:35] New patchset: Ottomata; "Meant to commit access_log.py before" [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2252 [01:05:36] New patchset: Ottomata; "pipeline/base.py - adding documentation" [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2253 [01:59:57] New review: Diederik; "Ok." [analytics/reportcard] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2249 [02:00:21] New review: Diederik; "Ok." [analytics/reportcard] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2248 [02:00:21] Change merged: Diederik; [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2249 [02:00:22] Change merged: Diederik; [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2248 [02:00:39] New review: Diederik; "Ok." [analytics/reportcard] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2250 [02:00:40] Change merged: Diederik; [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2250 [02:01:08] New review: Diederik; "Ok." [analytics/reportcard] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2251 [02:01:08] Change merged: Diederik; [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2251 [02:01:31] New review: Diederik; "Ok." [analytics/reportcard] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2253 [02:01:50] New review: Diederik; "Ok." [analytics/reportcard] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2252 [02:01:51] Change merged: Diederik; [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2253 [02:01:51] Change merged: Diederik; [analytics/reportcard] (master) - https://gerrit.wikimedia.org/r/2252 [03:06:21] RECOVERY - Disk space on srv219 is OK: DISK OK [03:06:21] RECOVERY - Disk space on srv223 is OK: DISK OK [03:52:41] anybody know if someone else fixed nagios or if it fixed itself? [03:53:23] !log moved all the individual puppet files out of place, stopped nagios, and re-ran puppet (at now minus 1.5hrs) [03:53:26] notpeter said he was fixing it [03:53:33] Logged the message, Master [03:53:55] maplebed: I did some stuff [03:54:00] and ran puppet a couple of times [03:54:04] and it didn't rebreak [03:54:07] ok. [03:54:10] so... we'll call it a win? [03:54:26] I moved all of /etc/nagios/puppet_checks.d/* to puppet_checks.d/trash and re-ran puppet, [03:54:31] but left before the puppet run finished [03:54:35] so I don't know what it actually did. [03:54:41] so, puppet created good stuff [03:54:57] the cp1002.cfg didn't have anything pathological in it [03:55:12] but the presence of the trash dir was causing nagios to not be able to start [03:55:18] so I rmed it [03:55:19] oh lame. [03:55:21] thanks. [03:55:22] and nagios ran fine [03:55:37] next time I'll put it in /tmp/ or something, but I wanted it to be obvious to anyboyd else poking [03:55:43] (which I guess it was, since you found it!) [03:55:49] thank you! it seems that you're the one that did the fixing [03:55:58] good. well it looks like it'll be ok for a bit, [03:56:10] all the puppet_checks.d/* files were huge, and now they're small. [03:56:20] yeah [03:56:22] it's sweet [03:56:26] I wonder if we shouldn't just rm the once a day or once a week or something until we can figure out how not to add them to the files over and over again... [03:56:30] nagios starts up mauch much faster now [03:56:31] ok, bart's here; brb [03:56:37] oh, nevermind. wrong train. [03:56:38] kk [03:56:46] yeah, i have thought about that as well [03:56:51] (hooray for wifi on my phone!) [03:56:52] :D [03:57:13] it would be much better to fix it for real, [03:57:35] but seriously, this happens so often and we don't have anything successfully watching nagios yet. [03:57:42] actually, that's probably the more improtant thing to do. [03:57:51] get watchmouse to watch nagios in a way that it'll alert when this happens [03:58:07] (this time, ct just pinged me; dunno how you started in on it.) [03:58:46] he pinged me as well [03:58:52] but yeah, that would be a good plan [03:58:55] I can set that up [03:59:01] just a string check for the nagios fail page [03:59:35] that should work. [03:59:57] yeah. I'll make a ticket. whiskey doesn't inhibit RT :) [04:00:44] rt 2379 [04:00:45] jsut for you. [04:00:46] :D [04:00:53] thanks! [04:00:53] mmmm.... [04:00:58] whiskey... [04:01:06] (almost home!) [04:03:13] maplebed: excellent. have a good night! [04:03:24] thanks! you too. [04:16:20] RECOVERY - Disk space on es1004 is OK: DISK OK [04:21:10] RECOVERY - MySQL disk space on es1004 is OK: DISK OK [04:43:55] PROBLEM - MySQL slave status on es1004 is CRITICAL: CRITICAL: Slave running: expected Yes, got No [09:11:23] PROBLEM - Puppet freshness on lvs1003 is CRITICAL: Puppet has not run in the last 10 hours [09:11:23] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: Puppet has not run in the last 10 hours [09:11:23] PROBLEM - Puppet freshness on mw65 is CRITICAL: Puppet has not run in the last 10 hours [09:45:38] PROBLEM - Puppet freshness on knsq9 is CRITICAL: Puppet has not run in the last 10 hours [09:54:18] PROBLEM - Puppet freshness on ms-fe1 is CRITICAL: Puppet has not run in the last 10 hours [10:03:38] PROBLEM - Disk space on srv219 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=64%): /var/lib/ureadahead/debugfs 0 MB (0% inode=64%): [10:06:32] New patchset: J; "add timedmediahandler files" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2254 [10:08:28] PROBLEM - Disk space on es1004 is CRITICAL: DISK CRITICAL - free space: /a 451676 MB (3% inode=99%): [10:17:58] PROBLEM - MySQL disk space on es1004 is CRITICAL: DISK CRITICAL - free space: /a 414417 MB (3% inode=99%): [10:26:58] RECOVERY - Disk space on srv219 is OK: DISK OK [11:38:01] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2216 [11:38:02] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2216 [12:12:07] New patchset: Mark Bergsma; "Revert "squid class not getting included for some reason. maybe this is a workaround?"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2255 [12:12:39] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2255 [12:12:52] Change abandoned: Mark Bergsma; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2255 [12:14:47] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2226 [12:14:48] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2226 [12:17:02] New patchset: Mark Bergsma; "Test what was up with include squid" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2256 [12:17:21] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2256 [12:18:08] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2256 [12:18:09] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2256 [12:21:12] New patchset: Mark Bergsma; "Fully qualify manifests/swift.pp includes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2257 [12:21:38] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2257 [12:21:39] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2257 [12:39:01] New patchset: Mark Bergsma; "Create a [volatile] puppet fileserver module for volatile files" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2258 [12:39:43] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2258 [12:39:44] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2258 [13:16:19] New patchset: Mark Bergsma; "Add support for squid config file serving by Puppet" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2259 [13:17:23] New patchset: Mark Bergsma; "Add support for squid config file serving by Puppet" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2259 [13:18:01] RECOVERY - Frontend Squid HTTP on cp1001 is OK: HTTP OK HTTP/1.0 200 OK - 27535 bytes in 0.202 seconds [13:21:41] RECOVERY - Backend Squid HTTP on cp1001 is OK: HTTP OK HTTP/1.0 200 OK - 27400 bytes in 0.196 seconds [13:28:15] New patchset: Mark Bergsma; "Add support for squid config file serving by Puppet" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2259 [13:32:37] mark, when do you think there will be time for a hour of enotif test as you said, now that mailman migration is over? [13:42:56] !log Disabled knsq1-15 in PyBal, preparing for decommissioning [13:42:59] Logged the message, Master [13:48:36] Reedy: how to use updateArticleCount.php? with "mwscript" or not? wikiversions.cdb has no version entry for `updateArticleCount.php`. [13:49:24] mwscript updateArticleCount.php enwiki [13:49:33] are you just running it on specific wikis? [13:49:39] yes [13:50:03] !bz 34184 [13:50:03] https://bugzilla.wikimedia.org/34184 [13:50:16] thanks, i am running it on vepwiki [13:50:28] "To update the site statistics table, run the script with the --update option." ok [13:50:49] sitestatesinit false [13:50:57] hmm we should see if [13:51:10] ah on something tiny, ok [13:51:11] I do like it when people tell us how to run maintenance scripts, and it's rather wrong [13:51:17] on a brand new one [13:51:22] yep [13:52:16] Running a lot of the scripts is a very grey area [13:52:31] no kidding [13:52:36] New patchset: ArielGlenn; "move kiwix mirror contents to public/other directory for mirroring" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2260 [13:52:42] "fine on something tiny. will crash the site on anything bigger." [13:52:46] :D [13:52:51] Yup... [13:53:01] It's something (else) that really should be documented [13:53:11] yes [13:53:14] so much to document [13:53:16] so little [13:53:21] !log resetting stats on new wikis per bz 34184: updateArticleCount.php vepwiki --update; updateArticleCount.php pnbwiktionary --update [13:53:21] inclination... [13:53:22] Logged the message, Master [13:55:19] New review: ArielGlenn; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2260 [13:55:20] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2260 [13:55:59] apergos: yeah, i dont know about the "as always" part (" Statistics are broken after import (as always)") [13:56:34] me either [14:02:27] arg, ticket reopened already [14:03:22] " Statistics are broken after import (as always)" [14:03:37] wrong paste..this "And did you also run refreshCategoryCounts.php?" [14:05:02] The MediaWiki script file "... refreshCategoryCounts.php" does not exist. ..uhmpf [14:06:09] hah [14:06:53] uhm..yeah.. "There are articles on the wiki so it's not possible that there are 0 found." .. no idea so far [14:07:06] lol [14:07:19] mutante, said maintenance script is on bug 18488 [14:07:34] added in r66140 and reverted [14:07:39] ouch, i hope not in a way that using it breaks stuff more :) [14:07:44] Try populateCategory.php [14:07:54] SiteStatsInit remains patched out for now. [14:07:58] you know that right? [14:08:04] i didnt [14:08:06] unless there's an update since then [14:08:12] from tim's mail about the outage [14:08:18] oh.. [14:08:23] was that only yesterday? yes it was [14:09:16] ah, on list. got it now [14:09:21] uh huh [14:10:36] ok, so this explains why we get 0 articles [14:10:59] but is also unrelated to stats problem on newly created wikis [14:12:18] sigh, who wants to explain on BZ:) [14:12:29] notme [14:12:45] sorry but I have waaay more angst with the mirrors than I need [14:12:56] don't want to add to my plate today :-D [14:13:05] np,i will :) [14:39:02] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2259 [14:39:02] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2259 [14:43:00] New patchset: Mark Bergsma; "Make mount volatile readable for the puppetmaster" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2261 [14:43:17] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2261 [14:43:36] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2261 [14:43:37] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2261 [15:13:11] New patchset: Mark Bergsma; "Don't run setup-aufs-cachedirs on squids that don't use aufs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2262 [15:14:54] New patchset: Mark Bergsma; "Don't run setup-aufs-cachedirs on squids that don't use aufs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2262 [15:17:08] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2262 [15:17:08] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2262 [15:20:21] !log Around 14:50 UTC, removed the 3 remaining esams upload squids in the knsq8-15 range from the config. This made ms5 unhappy. [15:20:23] Logged the message, Master [15:24:12] New patchset: Mark Bergsma; "It's kinda useful to know that db40 is not "just" a core db" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2263 [15:24:33] hah [15:25:16] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2263 [15:25:17] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2263 [15:26:41] 120203 15:17:00 InnoDB: Warning: cannot find a free slot for an undo log. Do you have too [15:26:41] InnoDB: many active transactions running concurrently? [15:30:32] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2230 [15:30:32] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2230 [15:30:46] Probably best to get asher to look at it when he's here [15:40:43] New patchset: Dzahn; "enhance purge_all - area code API lookup one-liner :p" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2264 [15:42:01] New patchset: Dzahn; "enhance purge_all - area code API lookup one-liner :p" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2264 [15:44:04] New patchset: Dzahn; "enhance page_all - area code API lookup one-liner :p - option to skip an area" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2264 [15:44:21] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2264 [15:45:06] New patchset: Mark Bergsma; "Retab squid.xml, decommission knsq1-15" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2265 [15:45:23] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2265 [15:45:32] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2265 [15:45:32] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2265 [15:56:19] New patchset: Mark Bergsma; "Decommission knsq1-15" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2266 [15:56:36] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2266 [16:00:23] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2266 [16:00:24] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2266 [16:08:04] !log db41 being reinstalled, appears down but logging to be safe [16:08:06] Logged the message, RobH [16:13:39] RECOVERY - Host db41 is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms [16:15:09] PROBLEM - HTTP on ekrem is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:16:15] New patchset: Mark Bergsma; "Assign new ganglia aggregators" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2267 [16:16:32] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2267 [16:18:37] New patchset: Mark Bergsma; "Assign new ganglia aggregators" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2267 [16:19:01] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2267 [16:19:02] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2267 [16:19:52] hi roan [16:20:13] diederik: I am ordering your hard disks for the logging server right now =] [16:20:29] super sweet :D [16:20:33] thx [16:23:08] so rob [16:23:12] are they 3.5" or 2.5"? [16:23:44] 3.5 [16:23:51] the misc servers use the large drives [16:23:56] cool [16:25:32] diederik: so they should ship today, or monday at the latest, and its two day estimated delivery. I expect wednesday to install these for you and spin up the server [16:26:25] great! can you or mark also configure it as a proxy server that will do multicast? [16:26:39] yeah that's even already in puppet [16:26:48] awesome, awesome! [16:26:49] PROBLEM - Host db41 is DOWN: PING CRITICAL - Packet loss = 100% [16:38:19] RECOVERY - HTTP on ekrem is OK: HTTP OK HTTP/1.1 200 OK - 453 bytes in 0.006 seconds [16:46:09] RECOVERY - Backend Squid HTTP on cp1014 is OK: HTTP OK HTTP/1.0 200 OK - 27400 bytes in 0.178 seconds [16:46:59] RECOVERY - Frontend Squid HTTP on cp1011 is OK: HTTP OK HTTP/1.0 200 OK - 27535 bytes in 0.180 seconds [16:47:39] RECOVERY - Frontend Squid HTTP on cp1012 is OK: HTTP OK HTTP/1.0 200 OK - 27535 bytes in 0.179 seconds [16:48:19] RECOVERY - Frontend Squid HTTP on cp1006 is OK: HTTP OK HTTP/1.0 200 OK - 27535 bytes in 0.202 seconds [16:48:59] RECOVERY - Frontend Squid HTTP on cp1019 is OK: HTTP OK HTTP/1.0 200 OK - 27535 bytes in 0.180 seconds [16:48:59] RECOVERY - Backend Squid HTTP on cp1016 is OK: HTTP OK HTTP/1.0 200 OK - 27400 bytes in 0.218 seconds [16:48:59] RECOVERY - Frontend Squid HTTP on cp1013 is OK: HTTP OK HTTP/1.0 200 OK - 27535 bytes in 0.221 seconds [16:50:29] RECOVERY - Backend Squid HTTP on cp1009 is OK: HTTP OK HTTP/1.0 200 OK - 27400 bytes in 0.200 seconds [16:50:29] RECOVERY - Frontend Squid HTTP on cp1007 is OK: HTTP OK HTTP/1.0 200 OK - 27535 bytes in 0.219 seconds [16:53:29] RECOVERY - Backend Squid HTTP on cp1017 is OK: HTTP OK HTTP/1.0 200 OK - 27400 bytes in 0.178 seconds [16:54:09] RECOVERY - Frontend Squid HTTP on cp1003 is OK: HTTP OK HTTP/1.0 200 OK - 27535 bytes in 0.179 seconds [16:54:58] RECOVERY - Backend Squid HTTP on cp1003 is OK: HTTP OK HTTP/1.0 200 OK - 27400 bytes in 0.161 seconds [16:54:58] RECOVERY - Backend Squid HTTP on cp1015 is OK: HTTP OK HTTP/1.0 200 OK - 27400 bytes in 0.200 seconds [16:54:59] RECOVERY - Frontend Squid HTTP on cp1018 is OK: HTTP OK HTTP/1.0 200 OK - 27535 bytes in 0.179 seconds [16:57:13] New patchset: Dzahn; "add account for Andrew Otto, add to host stat1 per RT 2375" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2268 [16:57:58] RECOVERY - Frontend Squid HTTP on cp1014 is OK: HTTP OK HTTP/1.0 200 OK - 27535 bytes in 0.190 seconds [16:57:58] RECOVERY - Frontend Squid HTTP on cp1020 is OK: HTTP OK HTTP/1.0 200 OK - 27535 bytes in 0.220 seconds [16:59:18] RECOVERY - Frontend Squid HTTP on cp1008 is OK: HTTP OK HTTP/1.0 200 OK - 27535 bytes in 0.189 seconds [16:59:59] RECOVERY - Backend Squid HTTP on cp1005 is OK: HTTP OK HTTP/1.0 200 OK - 27400 bytes in 0.200 seconds [17:00:45] New patchset: Dzahn; "add account for Andrew Otto, add to host stat1 per RT 2375 (alphabetical)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2268 [17:00:58] RECOVERY - Frontend Squid HTTP on cp1015 is OK: HTTP OK HTTP/1.0 200 OK - 27535 bytes in 0.219 seconds [17:01:05] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2268 [17:01:38] RECOVERY - Frontend Squid HTTP on cp1010 is OK: HTTP OK HTTP/1.0 200 OK - 27535 bytes in 0.163 seconds [17:01:38] RECOVERY - Frontend Squid HTTP on cp1004 is OK: HTTP OK HTTP/1.0 200 OK - 27535 bytes in 0.219 seconds [17:01:38] RECOVERY - Frontend Squid HTTP on cp1016 is OK: HTTP OK HTTP/1.0 200 OK - 27535 bytes in 0.182 seconds [17:03:28] RECOVERY - Frontend Squid HTTP on cp1005 is OK: HTTP OK HTTP/1.0 200 OK - 27535 bytes in 0.219 seconds [17:03:38] RECOVERY - Backend Squid HTTP on cp1018 is OK: HTTP OK HTTP/1.0 200 OK - 27400 bytes in 0.178 seconds [17:03:38] RECOVERY - Frontend Squid HTTP on cp1009 is OK: HTTP OK HTTP/1.0 200 OK - 27535 bytes in 0.183 seconds [17:04:48] RECOVERY - Backend Squid HTTP on cp1020 is OK: HTTP OK HTTP/1.0 200 OK - 27400 bytes in 0.178 seconds [17:06:18] RECOVERY - Frontend Squid HTTP on cp1017 is OK: HTTP OK HTTP/1.0 200 OK - 27535 bytes in 0.189 seconds [17:07:58] RECOVERY - Backend Squid HTTP on cp1010 is OK: HTTP OK HTTP/1.0 200 OK - 27400 bytes in 0.188 seconds [17:09:08] RECOVERY - Backend Squid HTTP on cp1004 is OK: HTTP OK HTTP/1.0 200 OK - 27400 bytes in 0.186 seconds [17:09:38] RECOVERY - Backend Squid HTTP on cp1013 is OK: HTTP OK HTTP/1.0 200 OK - 27400 bytes in 0.193 seconds [17:10:28] RECOVERY - Backend Squid HTTP on cp1011 is OK: HTTP OK HTTP/1.0 200 OK - 27400 bytes in 0.165 seconds [17:12:28] RECOVERY - Backend Squid HTTP on cp1012 is OK: HTTP OK HTTP/1.0 200 OK - 27400 bytes in 0.219 seconds [17:14:58] RECOVERY - Backend Squid HTTP on cp1008 is OK: HTTP OK HTTP/1.0 200 OK - 27400 bytes in 0.187 seconds [17:19:48] PROBLEM - Host db42 is DOWN: PING CRITICAL - Packet loss = 100% [17:21:08] RECOVERY - Backend Squid HTTP on cp1007 is OK: HTTP OK HTTP/1.0 200 OK - 27400 bytes in 0.198 seconds [17:23:54] New patchset: Mark Bergsma; "Make gmetad restart upon config file changes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2269 [17:24:11] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2269 [17:24:21] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2269 [17:24:21] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2269 [17:24:58] RECOVERY - Backend Squid HTTP on cp1019 is OK: HTTP OK HTTP/1.0 200 OK - 27400 bytes in 0.199 seconds [17:25:28] RECOVERY - Backend Squid HTTP on cp1006 is OK: HTTP OK HTTP/1.0 200 OK - 27400 bytes in 0.186 seconds [17:26:35] !log updated dns for manutius.mgmt [17:26:36] Logged the message, RobH [17:31:52] New patchset: Mark Bergsma; "Fix remaining file modes in ganglia::web" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2270 [17:32:09] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2270 [17:32:16] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2270 [17:32:17] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2270 [17:34:36] New patchset: Mark Bergsma; "gmetad's init script doesn't support status" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2271 [17:34:53] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2271 [17:35:01] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2271 [17:35:02] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2271 [17:48:15] New patchset: Mark Bergsma; "Add new squid servers cp1001-1020 to torrus" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2272 [17:48:46] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2272 [17:48:47] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2272 [17:52:42] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2268 [17:52:42] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2268 [17:56:12] New patchset: Mark Bergsma; "Fix indentation, modes of ganglia.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2273 [17:58:42] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2273 [17:58:42] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2273 [18:01:20] New patchset: Mark Bergsma; "Merge remote-tracking branch 'origin/production' into test" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2274 [18:01:40] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2274 [18:01:54] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2274 [18:01:55] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2274 [18:03:52] New patchset: Mark Bergsma; "Revert "Merge remote-tracking branch 'origin/production' into test"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2275 [18:04:09] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2275 [18:22:35] RECOVERY - Host db41 is UP: PING OK - Packet loss = 0%, RTA = 0.17 ms [18:32:35] RECOVERY - Host db42 is UP: PING OK - Packet loss = 0%, RTA = 0.19 ms [18:33:53] New patchset: Mark Bergsma; "Revert "Merge remote-tracking branch 'origin/production' into test"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2276 [18:34:21] Change abandoned: Mark Bergsma; "bad revert" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2275 [18:34:47] New review: Mark Bergsma; "THIS REVERT MAY NEED TO BE REVERTED ON THE NEXT test -> production MERGE!" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2276 [18:34:48] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2276 [18:41:55] PROBLEM - Host db41 is DOWN: PING CRITICAL - Packet loss = 100% [19:11:05] RobH: can you put together a list of which of our db servers are no longer under warranty? [19:11:23] !log manutius installed and ready for use [19:11:25] Logged the message, RobH [19:11:44] binasher: yes, do you mind dropping a ticket in core ops and just assigning it to me? [19:11:50] sure [19:11:56] cool, thanks [19:12:08] I can pull our racktables data, and any I am missing I can have Dell pull for us [19:13:04] New patchset: Lcarr; "Changed xmit_hash_policy to a variable so that we can set it to any policy desired" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2277 [19:13:23] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2277 [19:18:29] New patchset: Bhartshorne; "add nagios to iptables, enable nagios to talk to swift via nrpe" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2278 [19:18:50] New patchset: Asher; "upgrading dbs 13,18,25,33" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2279 [19:19:08] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2278 [19:19:08] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2278 [19:19:09] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2279 [19:20:49] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2279 [19:20:50] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2279 [19:22:04] RECOVERY - RAID on ms-fe2 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [19:22:44] PROBLEM - Puppet freshness on lvs1003 is CRITICAL: Puppet has not run in the last 10 hours [19:22:44] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: Puppet has not run in the last 10 hours [19:22:44] PROBLEM - Puppet freshness on mw65 is CRITICAL: Puppet has not run in the last 10 hours [19:23:39] binasher, you might want to check db40... had a period of nearly 100% cpu with a lot of cpu wait [19:23:45] 120203 15:17:00 InnoDB: Warning: cannot find a free slot for an undo log. Do you have too [19:23:46] InnoDB: many active transactions running concurrently? [19:23:59] yay [19:24:20] Think it only lasted half an hour or so, but some users did notice problems [19:25:04] RECOVERY - DPKG on ms-fe2 is OK: All packages OK [19:25:44] RECOVERY - Disk space on ms-fe2 is OK: DISK OK [19:30:34] RECOVERY - Puppet freshness on ms-fe1 is OK: puppet ran at Fri Feb 3 19:30:10 UTC 2012 [19:30:57] oooh, db40 melting? [19:30:59] interesting [19:31:09] admission control for the win, I guess [19:31:38] enabling purge thread in the new build might help [19:31:59] do we clean up stuff nowadays? [19:32:18] it is at 1.6T now [19:32:34] RECOVERY - RAID on ms-fe1 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [19:32:47] occasionally, tim ran deletes last month, but it isn't scheduled [19:32:59] could have also been battery recharge cycle [19:33:01] or something like that [19:33:31] is admission control configuration documented anywhere? [19:33:47] hm :) [19:33:59] definitely in our internal release notes [19:34:07] our documentation is way better than what oracle provides! [19:35:19] anyway, once it is in the build, you need to add two columns [19:35:24] to mysql.user [19:35:29] I wonder if scripts do that [19:36:18] I wonder what was the melt [19:37:04] RECOVERY - DPKG on ms-be1 is OK: All packages OK [19:37:13] could be also some massive parser cache invalidation or something like that [19:37:17] oh well, we won't know [19:37:54] RECOVERY - DPKG on ms-fe1 is OK: All packages OK [19:38:34] RECOVERY - Disk space on ms-fe1 is OK: DISK OK [19:39:43] http://ganglia.wikimedia.org/latest/?r=4hr&cs=&ce=&m=&c=MySQL+pmtpa&h=db40.pmtpa.wmnet&tab=m&vn=&mc=2&z=medium&metric_group=ALLGROUPS [19:40:02] New patchset: Pyoungmeister; "having /etc/sudoers include sudoers.d will help a lot." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2280 [19:40:15] there's a little dip in innodb_buffer_pool_pages_dirty that lines up with what seems to be the stall [19:40:22] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2280 [19:40:45] what is that drop in memory chart?! [19:40:47] :-) [19:41:04] RECOVERY - RAID on ms-be1 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [19:41:08] ah [19:41:11] look at wio jump [19:41:18] maybe just disk stall during a flush aka bbu recycle at the time [19:41:40] so not internal mysql problem [19:41:43] yeh, bbu recycle [19:42:03] otoh [19:42:17] or gmond stalled too and what's in ganglia is crap :) [19:42:47] well, wio hike says something went wrong with i/o, lets see arcconf logs [19:43:39] * domas frowns at arcconf, it shows same event date for all events [19:43:40] :) [19:43:48] There also was a nagios raid warning of "Unable to read output", but that's gone since [19:44:01] but yeah, looks like all the events were about FSA_EM_ENHANCED_BATTERY_CHANGE [19:44:36] can we force writeback during bbu recycle? [19:44:53] I thought we already forced it that way [19:45:16] it is in 'wb' now, not 'wbb' [19:45:41] though sometimes that fails and controller needs poking [19:46:06] root@db40:/a/sqldata-cache# arcconf setcache 1 logicaldrive 0 wb [19:46:06] Controllers found: 1 [19:46:06] WARNING: Power failure without battery/ZMM support will lead to data loss. [19:46:06] Do you wish to continue? [19:46:06] Press y, then ENTER to continue or press ENTER to abort: y [19:46:07] Command completed successfully. [19:46:09] root@db40:/a/sqldata-cache# [19:46:11] heh [19:46:16] * domas did wb->wbb->wb cycle [19:46:32] uh oh [19:46:32] [19722887.120901] Northbridge Error, node 1 [19:46:33] [19722887.168871] L3 Cache Tag error. [19:46:47] long ago though [19:46:48] :) [19:47:13] poor server [19:47:33] odd, see a sharp drop in disk space? [19:48:12] hehe, it blew up in memory a bit at that time [19:48:32] it would be so nice [19:48:35] if ganglia graphs [19:48:37] had proper metrics [19:48:55] New patchset: Bhartshorne; "changed URL for nagios check to ms-fe to something that exists" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2281 [19:49:13] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2281 [19:49:27] like per-second [19:49:36] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2281 [19:49:37] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2281 [19:49:46] binasher: odd, pretty much every load metric went up afterwards A LOT [19:49:54] binasher: including queries/writes/bytes sent/etc [19:50:00] so it was a cache purge event more like it [19:50:01] :) [19:50:11] with resulted in lots of writes/replaces/purging/etc [19:50:17] binasher: did I catch some of your puppet changes? [19:50:25] mysql.pp and site.pp [19:50:27] binasher: nothing server related :( [19:50:36] maplebed: fine if you did [19:50:41] ok, I'll merge them. [19:50:52] binasher: I'd think someone invalidated all the cache somewhere :) [19:51:15] yeah !r2279 [19:51:26] merged. [19:53:27] min exptime on one table is 2012-01-05 21:23:31, so i guess that's when the purger last ran [19:53:47] Change abandoned: Pyoungmeister; "this would conflict with the /etc/sudoers file for all apaches :/" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2280 [19:55:59] domas: can you see if any admission control documentation is shareable? just checked and there's nothing with the lp source (apart from sparse comments in the source) [19:56:21] untested patches are way better than undocumented ones [19:56:54] :-) [19:58:56] domas: is it just enough to comment out $wgObjectCaches['mysql-multiwrite'] to disable db40? (will it still use memcache in that case?) [19:59:18] it may or it may not [19:59:21] anyway, it was workload fuckup [20:00:45] yup but i want to update to a newer build [20:00:57] with purge thread and ac (in case we ever use ac) [20:01:17] hehe [20:01:17] sure [20:01:26] you can comment it out for maintenance actions [20:01:29] "may or may not" == scariest thing ever [20:01:31] assuming my fix wasn't reverted [20:01:38] well, I had to patch it few times for it to be used [20:03:32] oh the lulz [20:03:58] what were you patches? [20:08:57] binasher: I was setting 30d expiration on memcached where expiration was larger than 30d [20:09:04] binasher: was committed into wmf branch [20:09:07] not sure if it went anywhere [20:09:14] and already got lost once in an upgrade [20:09:15] :) [20:09:27] oh [20:09:39] i thought you meant mysql patches! [20:11:38] New patchset: Bhartshorne; "allow nagios (and everybody else) to get into swift stuff to check usage for check_disk" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2282 [20:11:57] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2282 [20:11:57] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2282 [20:11:58] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2282 [20:14:22] our memcached's have to evict enough that i don't know if high ttl's would matter [20:19:50] binasher: do you know what the emails about nagios not being in the sudoers file for the dbs are about? they only started 2 days ago. [20:19:52] or maybe yesterday. [20:20:07] maplebed: notpeter is working on fixing [20:20:12] ok. [20:20:13] tnx. [20:20:42] RECOVERY - Disk space on ms-be1 is OK: DISK OK [20:27:02] PROBLEM - MySQL Slave Delay on db1020 is CRITICAL: CRIT replication delay 251 seconds [20:32:56] !log upgraded mysql on dbs 13,18,25,33 [20:32:57] Logged the message, Master [20:35:11] New patchset: Pyoungmeister; "redoing the way we handle sudo so that /etc/sudoers.d/ is always used" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2283 [20:35:29] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2283 [20:36:31] binasher, maplebed: would you be willing to look at that patch? [20:36:36] sure. [20:36:43] thanks! [20:36:47] oh god. [20:36:51] :) [20:37:07] all I really want [20:37:18] is for every box to look at /etc/sudoers.d [20:37:57] althoguh that dir doesn't exist by default on pre-lucid [20:38:27] # Note that there must be at least one file in the sudoers.d directory (this [20:38:27] # one will do), and all files in this directory should be mode 0440. [20:38:37] +1 permissions. [20:38:46] you're setting it to 755 but it should be 440. [20:39:05] (which, paradoxically, is ok in puppet even for dirs; 440 -> 550 when it's a directory. magic!) [20:39:10] maplebed: 0440 = files, not the directory [20:39:18] is 755 on lucid hosts where it's autocreated [20:39:25] yeah, need at least 111 [20:39:30] er [20:39:31] 100 [20:39:45] maplebed: wait [20:39:46] what? [20:39:49] alright... [20:40:00] notpeter: the documentation implies that including the directory will break sudo if there isn't a file in there, even if its only one that contains commetns [20:40:06] I will accept your voodoo black magic [20:40:08] it's so you can say recurse => true on the directory. [20:40:30] and recurse => true + mode => 440 == mode => 550 for directories and 440 for files within the directories. [20:40:33] maplebed: oh, that makes sense [20:40:47] binasher: harumph. ok. [20:41:02] +1 placeholder file. [20:41:07] yeah, I shall add one. [20:41:51] binasher: if you think the sudoers.d should be 755 instead of 550, I'd be cool with that, but if all the files inside are 440, may as well make the dir 550 too/ [20:41:52] ? [20:42:43] 550 seems suitable [20:43:53] what matters is that 440 for the files is mandatory [20:50:02] RECOVERY - MySQL Slave Delay on db1020 is OK: OK replication delay 0 seconds [20:57:22] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 2325 seconds [20:58:06] New patchset: Pyoungmeister; "redoing the way we handle sudo so that /etc/sudoers.d/ is always used" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2283 [20:58:11] alrighty [20:58:15] that should be better [20:58:24] another review? [20:58:24] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2283 [20:58:38] I'm just a little anxious, as this would break deploy on apaches if I fuck itup... [20:59:16] you've got a trailing tab after the placeholder content... [20:59:27] (thanks for making it BRIGHT RED, gerrit...) [21:00:22] notpeter: I would add recurse => true to the directory so if other people create sudoers.d/files and screw up the permissions, this will smack them into place (because I think sudoers complains if any of the files are the wrong permissions) [21:01:31] maplebed: will recurse => true only fix the perms? or will it do other stuff too? [21:01:58] I think it only forces perms, but it might also do other stuff (delete files? dunno.) [21:02:24] hrm, ok, lemme look into that [21:04:08] notpeter: you have some duplicate entries in sudoers.default and sudoers.$specific-thing. [21:04:35] moving from each service providing its own complete sudoers file to each only providing a portion means you should also remove redundant stuff from the portions [21:04:52] (eg root being allowed to run stuff) [21:05:23] oh, you did get root. [21:06:08] hm. [21:06:13] maybe you did get them all. [21:06:36] eventually we should pull the nagios one out (into either default or its own .nagios file) [21:06:41] but I don't think you need to for this switch. [21:07:29] maplebed: I think I de-duped [21:07:34] but I could have missed something [21:08:43] yeah, I think you did get them all. [21:08:45] so nevermind. [21:09:26] oh hey. [21:09:31] maplebed: Everything puppet did not put into the directory gets removed [21:09:39] so, this actually sounds good for sudoers [21:09:44] but might break things currently. [21:09:52] notpeter: you're deleting nrpe_fundraising but it's still referenced (at least in my checkout) in misc/fundraising.pp. [21:10:01] oh! [21:10:04] fudge [21:10:10] I think it's useless now [21:10:14] I'm going to check with jeff [21:10:22] you know, I'm tempted to say use recurse and break stuff that people have thrown in to sudoers.d/ by hand, but that's a little evil. [21:10:28] but will put us in a better place. [21:10:38] because we don't want shit appearing in that directory by hand. [21:10:42] maplebed: yeah, from a sec perspective it's *much* better [21:10:51] maybe not on a friday. [21:11:07] but I'd say send mail and do it next week. [21:11:26] not on a friday, [21:11:31] that would just be mean [21:15:25] yeah, I'm going to set recurse = > true, and not merge and send out an email [21:19:23] jeff confirmed that that nrpe_fundraising thing is junk now [21:21:51] New patchset: Pyoungmeister; "redoing the way we handle sudo so that /etc/sudoers.d/ is always used" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2283 [21:22:09] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2283 [21:26:21] New patchset: Bhartshorne; "adding http monitoring to swift proxy servers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2284 [21:26:39] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2284 [21:28:36] New review: Bhartshorne; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/2284 [21:28:38] Change merged: Bhartshorne; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2284 [21:40:40] *** SECURITY information for db24.pmtpa.wmnet *** is officially old [21:43:47] Jeff_Green: that shit is fucked up [21:44:00] hella [21:44:03] I can run the command as the nagios user from the command line [21:44:03] fine. I live hacked it to shut it up [21:44:16] apergos: ++ [21:44:18] now the puppet fix can take its time [21:44:39] so look at the error message again [21:44:43] apergos: what did you do? [21:44:47] what user isn't in sudoers? [21:45:08] oh [21:45:09] man [21:45:12] :-P [21:45:13] that's so dumb [21:45:51] and there ya have it [21:46:53] apergos: you should comment on my email about our sudoers setup :) [21:47:32] maybe [21:48:03] i.e. maybe later... (I admit to being off the clock for awhile here, it's almost midnight) [21:48:22] I suppose that's a good reason [21:48:40] i guess check-raid.py could also say "if ur root, don't sudo" [21:49:13] it could but [21:49:24] we could have a root line, it's sorta standard [21:49:36] binasher: or you could just brazenly merge in my patch =P [21:49:38] whatevs [21:49:58] merge your own patch! [21:50:19] NO U [21:50:35] maybe "be bold" isn't quite the right saying here [21:50:36] er, wait, this is a public channel isn't it... [21:50:42] delete delete delete [21:51:07] apergos: yeah, 4:50 on a friday is not Bold O'Clock [21:51:15] heh [21:59:53] New patchset: Asher; "more db upgrades" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2285 [22:00:12] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2285 [22:00:47] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2285 [22:00:48] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2285 [22:11:20] New patchset: Asher; "wrong array" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2286 [22:11:38] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/2286 [22:11:47] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2286 [22:11:47] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2286 [22:19:46] * Jeff_Green heading out. good weekends all. [22:21:40] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [22:23:17] !log rebooted db35, db39 [22:23:18] Logged the message, Master [22:29:51] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/2277 [22:29:52] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/2277 [22:39:53] !log db35 had an iblogfile size inconsistent with other s5 hosts. streaming a hotbacking of db1034 to db35 [22:39:55] Logged the message, Master [22:55:54] PROBLEM - Swift HTTP on copper is CRITICAL: Connection refused [22:57:04] PROBLEM - Swift HTTP on owa3 is CRITICAL: Connection refused [22:58:14] PROBLEM - Swift HTTP on magnesium is CRITICAL: Connection refused [23:00:54] PROBLEM - Swift HTTP on owa2 is CRITICAL: Connection refused [23:01:34] PROBLEM - Swift HTTP on zinc is CRITICAL: Connection refused [23:05:34] PROBLEM - mysqld processes on db35 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [23:06:43] !log upgrading percona-toolkit to 2.02 on all coredbs [23:06:45] Logged the message, Master [23:07:02] !log applying loopback filter on cr2-eqiad [23:07:03] Logged the message, Mistress of the network gear. [23:16:36] testing and timing pt-online-schema change to dewiki.revision on db1021 (not in rotation) [23:16:51] er [23:16:53] !log testing and timing pt-online-schema change to dewiki.revision on db1021 (not in rotation) [23:16:55] Logged the message, Master [23:21:14] !log timing the same operation as a normal alter on db1005. expect db lag to get backed up by hours [23:21:15] Logged the message, Master [23:50:18] PROBLEM - MySQL Slave Delay on db1005 is CRITICAL: CRIT replication delay 1772 seconds [23:59:51] !log applying loopback filter on cr1-sdtpa [23:59:52] Logged the message, Mistress of the network gear.