[00:07:42] 6Labs, 3Labs-Sprint-107, 5Patch-For-Review: Build proper monitoring for making sure that processes that need to run only once on one labstore only are running only once on one labstore only - https://phabricator.wikimedia.org/T106590#1497383 (10yuvipanda) ok, so we have ensure => stopped set for nfs-exports... [02:42:54] 6Labs, 6operations, 10wikitech.wikimedia.org: Figure out what to do about maintenance scripts on silver/wikitech - https://phabricator.wikimedia.org/T107547#1497611 (10Krenair) 3NEW [02:47:23] 6Labs, 6operations, 10wikitech.wikimedia.org: Figure out what to do about maintenance scripts on silver/wikitech - https://phabricator.wikimedia.org/T107547#1497625 (10Krenair) [06:45:41] PROBLEM - Puppet failure on tools-shadow is CRITICAL 30.00% of data above the critical threshold [0.0] [06:48:47] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1409 is CRITICAL 40.00% of data above the critical threshold [0.0] [06:59:39] 6Labs, 6operations: Investigate jessie-backports for labstores - https://phabricator.wikimedia.org/T107507#1497768 (10MoritzMuehlenhoff) If the apt pinning for backports is configured in a way that backports is only selected on a case-by-case basis by running e.g. "apt-get install -t jessie-backports install f... [07:00:25] 6Labs, 6operations: Investigate jessie-backports for labstores - https://phabricator.wikimedia.org/T107507#1497769 (10yuvipanda) So is there a way for us to do -t with puppet? [07:08:55] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Saqib was created, changed by Saqib link https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Access_Request/Saqib edit summary: Created page with "{{Tools Access Request |Justification=hosting Wikimedia affiliate website |Completed=false |User Name=Saqib }}" [07:25:47] RECOVERY - Puppet failure on tools-shadow is OK Less than 1.00% above the threshold [0.0] [07:28:47] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1409 is OK Less than 1.00% above the threshold [0.0] [07:37:09] 6Labs: Have checkpoint check for Wikitech availability - https://phabricator.wikimedia.org/T107457#1497818 (10yuvipanda) This is already pretty expensive, not sure if we need it for static. [07:37:31] 6Labs: Have checkpoint check for Wikitech availability - https://phabricator.wikimedia.org/T107457#1497819 (10yuvipanda) To note - we want it in catchpoint only if we care about bringing up actual numbers - icinga for just alerting. [07:39:02] 6Labs, 3Labs-Sprint-107: Evaluate a 'cluster solution' for use on Tool Labs - https://phabricator.wikimedia.org/T106475#1497824 (10yuvipanda) [08:03:34] 6Labs, 6operations: Investigate jessie-backports for labstores - https://phabricator.wikimedia.org/T107507#1497836 (10faidon) The difference between the two labstores is because of a change in upstream d-i during the jessie RC cycle that disabled backports by default, cf. [[ https://bugs.debian.org/764982 | De... [08:04:13] 6Labs, 6operations: Investigate whether to use Debian's jessie-backports - https://phabricator.wikimedia.org/T107507#1497837 (10faidon) p:5Triage>3Normal [08:22:11] (03PS1) 10Sitic: Add FlaggedRevs support [labs/tools/crosswatch] - 10https://gerrit.wikimedia.org/r/228212 (https://phabricator.wikimedia.org/T101456) [08:52:30] 10Gerrit-Patch-Uploader: Delte https://gerrit.wikimedia.org/r/186635 - https://phabricator.wikimedia.org/T107558#1497909 (10Physikerwelt) 3NEW [08:59:43] (03PS2) 10Sitic: Add FlaggedRevs support [labs/tools/crosswatch] - 10https://gerrit.wikimedia.org/r/228212 (https://phabricator.wikimedia.org/T101456) [09:01:40] (03CR) 10Sitic: [C: 032 V: 032] Add FlaggedRevs support [labs/tools/crosswatch] - 10https://gerrit.wikimedia.org/r/228212 (https://phabricator.wikimedia.org/T101456) (owner: 10Sitic) [09:47:13] Does anyone know how a user comes to have null user_registration? According to the replicas, there are 490766 such users on English Wikipedia. [09:51:33] GoldenRing: see https://phabricator.wikimedia.org/T24097 they are users registered around 2005 [09:54:15] That's somewhere around 2% of all registered users. [09:54:30] sitic: Ah, okay, thanks. [10:56:40] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1407 is CRITICAL 40.00% of data above the critical threshold [0.0] [12:17:29] 6Labs, 10Tool-Labs, 5Patch-For-Review: support python3 uwsgi apps - https://phabricator.wikimedia.org/T104374#1498208 (10GoldenRing) I've updated the wiki with some guidelines on how to get python3 working using this. The biggest problem seems to be that uWSGI mountpoints don't work with Python 3, but that'... [12:50:14] Is there anyone about who could run an EXPLAIN query for me against the enwiki replica? I don't have the necessary database rights. [12:53:57] GoldenRing, paste it on phabricator [12:54:45] Do you mean as a new ticket? Sorry, still a bit new to phabricator. [12:55:02] there is a paste functionality [12:55:22] https://phabricator.wikimedia.org/paste/create/ [12:56:01] you can paste it anywhere, really, I am trying to make that fuctionality more known [12:56:31] normally https://tools.wmflabs.org/tools-info/optimizer.py should work, but it's having a permission denied issue [12:57:21] valhallasw`cloud, do you know where the code is? [12:57:53] jynus: /data/project/tools-info on tool labs. it tries to connect to a database, but the password doesn't work (replica.my.cnf broken? not sure) [13:00:05] 10Tool-Labs-tools-Other: tools-info optimizer.py: "Access denied for user 's51772'@'10.68.18.46' (using password: YES)" - https://phabricator.wikimedia.org/T107571#1498247 (10valhallasw) 3NEW [13:00:06] ^ [13:03:08] the user exists, so probably that is the case [13:03:47] if it is not urgent, I will let yuvipanda investigate, as he changed the script recently [13:03:58] will be more efficiet [13:04:09] Huh [13:04:16] That shouldn't have affected it at all [13:04:24] That's troubling [13:04:25] strange, then [13:04:34] I'll take a look in a bit [13:04:41] go back to seep [13:04:53] sorry I pinged you [13:05:07] GoldenRing, about that query? [13:05:07] No I'm up early :) [13:05:55] 6Labs, 6operations: Investigate whether to use Debian's jessie-backports - https://phabricator.wikimedia.org/T107507#1498261 (10coren) >>! In T107507#1497836, @faidon wrote: > The difference between the two labstores is because of a change in upstream d-i during the jessie RC cycle that disabled backports by d... [13:06:27] Just a moment, sorry, found other problems with the query :-) [13:09:33] jynus: https://phabricator.wikimedia.org/P1722 [13:10:07] The subquery runs very quickly, so I'm a bit surprised that the join to it is so slow. [13:20:05] GoldenRing: you can user logging_userindex instead of logging [13:28:28] GoldenRing, here it is yout query vs. your query on the original db vs. your query using logging_userindex: https://phabricator.wikimedia.org/P1722 [13:28:38] thanks, eranroz for the suggestion [13:29:09] 6Labs, 6operations, 3Labs-Sprint-102, 3Labs-Sprint-103, and 4 others: labstore has multiple unpuppetized files/scripts/configs - https://phabricator.wikimedia.org/T102478#1498283 (10coren) >>! In T102478#1494562, @faidon wrote: > Sounds good to me, although I'd really prefer it if we reinstalled labstore10... [13:31:10] 6Labs, 6operations, 3Labs-Sprint-102, 3Labs-Sprint-103, and 4 others: Reinstall labstore1001 and make sure everything is puppet-ready - https://phabricator.wikimedia.org/T107574#1498287 (10coren) 3NEW a:3coren [13:41:50] Brilliant, thanks. [13:45:58] 6Labs, 6operations: Investigate whether to use Debian's jessie-backports - https://phabricator.wikimedia.org/T107507#1498324 (10scfc) (I'd prefer the pinning approach because it's easier to read and I intend to use it in the future for #Tool-Labs execution instances, but:) If backports is only needed for `pyt... [13:49:48] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1409 is CRITICAL 60.00% of data above the critical threshold [0.0] [13:53:21] 6Labs, 10Analytics, 10Labs-Infrastructure: Set up cron job on labstore to rsync data from stat* boxes into labs. - https://phabricator.wikimedia.org/T107576#1498336 (10Ottomata) 3NEW a:3Ottomata [13:56:28] jynus, eranroz: Please have a look at the follow-up question on the same phabricator ticket. A query trying to achieve a similar thing, which once again takes much longer than I'd expect. [13:58:33] Is it trying to group every row in the join and then apply the limit? [14:03:12] 6Labs, 10Analytics, 10Labs-Infrastructure, 5Patch-For-Review: Set up cron job on labstore to rsync data from stat* boxes into labs. - https://phabricator.wikimedia.org/T107576#1498355 (10Ottomata) That patch ^ will allow labstore1003 to rsync from stat*::srv/... No idea where to put a cronjob on labstore1... [14:03:29] GoldenRing: try to use log_action instead of log_type [14:04:51] Perfect. Thanks very much. [14:24:47] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1409 is OK Less than 1.00% above the threshold [0.0] [14:31:13] 6Labs, 6operations, 3Labs-Sprint-107, 3ToolLabs-Goals-Q4: Investigate kernel issues on labvirt** hosts - https://phabricator.wikimedia.org/T99738#1498398 (10Andrew) I think I've found a new issue with the .19 kernel, so investigating further today. [14:49:52] eranroz: Is it generally safe to assume that user.user_id and user.user_registration will produce the same ordering?? [14:58:27] GoldenRing: That'll hold true for the most part, but not always - especially on very old users. [14:59:22] Okay, I'm only interested in fairly recent users. I'm seeing some very odd results, but I'm just trying to sanity check a bit before I yell too loud... [15:05:25] See eg http://tools.wmflabs.org/movestats/ - change the user limit to 1000 and some large, very regular gaps appear in the chart. That can't be coincidence, surely? [15:07:22] See https://phabricator.wikimedia.org/P1722 (near the bottom) for the query being used, with s/log_type/log_action/g. [15:34:50] (03PS1) 10Sitic: Don't text-transform uppercase navbar buttons [labs/tools/crosswatch] - 10https://gerrit.wikimedia.org/r/228286 [15:34:52] (03PS1) 10Sitic: Fix connectionerror handler [labs/tools/crosswatch] - 10https://gerrit.wikimedia.org/r/228287 [15:35:11] (03CR) 10Sitic: [C: 032 V: 032] Don't text-transform uppercase navbar buttons [labs/tools/crosswatch] - 10https://gerrit.wikimedia.org/r/228286 (owner: 10Sitic) [15:35:18] (03CR) 10Sitic: [C: 032 V: 032] Fix connectionerror handler [labs/tools/crosswatch] - 10https://gerrit.wikimedia.org/r/228287 (owner: 10Sitic) [15:44:20] 10Tool-Labs-tools-Other: Geohack should be mobile friendly - https://phabricator.wikimedia.org/T103409#1498602 (10Jdlrobson) Sample url: https://en.m.wikivoyage.org/w/index.php?title=Special:MapSources¶ms=52%C2%B0+31%E2%80%B2+N%2C+13%C2%B0+24%E2%80%B2+30%E2%80%B3+E%2C+scale%3D50000 @Thogoiter thanks for the... [16:43:34] Which is more pythonish? "not foo in bar" or "foo not in bar"? [16:44:34] Coren: foo not in bar [16:45:59] coren pep8 alerts for win! they'll reccomend the latter automatically [16:46:09] Syntastic is the nice vim plugin [16:47:22] pep8 doesn't say anything about it for me? [16:47:52] maybe an old pep8 [16:47:52] https://stackoverflow.com/questions/24671925/pep8-e713-test-for-membership-should-be-not-in [16:49:42] valhallasw`cloud: syntastic alerts me [16:50:31] dat screenshot [16:50:32] https://raw.githubusercontent.com/scrooloose/syntastic/master/_assets/screenshot_1.png [16:51:05] heh [16:53:06] 6Labs, 3Labs-Sprint-107: Evaluate a 'cluster solution' for use on Tool Labs - https://phabricator.wikimedia.org/T106475#1498905 (10yuvipanda) [16:55:41] 6Labs, 3Labs-Sprint-107: Evaluate a 'cluster solution' for use on Tool Labs - https://phabricator.wikimedia.org/T106475#1498922 (10yuvipanda) [16:58:11] on which db are userdatabases stored? [16:58:19] tools.commons-delinquent@tools-bastion-02:~$ mysql -h s1.labsdb -e "show databases;" [16:58:19] ERROR 1045 (28000): Access denied for user 'steinsplitter'@'10.68.16.44' (using password: NO) [17:00:19] wtf. no pass [17:01:07] well [17:01:13] that's because you haven't given it a password? [17:01:35] YuviPanda: rather because of a broken .my.cnf, I think, but they already left [17:01:55] (user 'steinsplitter' should be 'u1234' or 's1234') [17:01:55] valhallasw`cloud: (using password: NO) [17:02:02] you need to pass -read-defaults [17:02:03] that too [17:14:31] is it known that labvirt1005 just flapped? [17:18:01] (yes, andrew is working on it, testing some kernel updates) [17:18:02] from -operations: yes, it's known, don't worry [17:18:06] :) [17:31:14] 6Labs, 3Labs-Sprint-107: Evaluate a 'cluster solution' for use on Tool Labs - https://phabricator.wikimedia.org/T106475#1499012 (10yuvipanda) [17:31:30] 6Labs, 3Labs-Sprint-107: Evaluate a 'cluster solution' for use on Tool Labs - https://phabricator.wikimedia.org/T106475#1469900 (10yuvipanda) [17:31:53] valhallasw`cloud: ^ I'm starting to write out evaluation criteria, if you're interested in looking [17:42:21] 6Labs, 7Database: Measure capacity and utilization of labsdb*** boxes - https://phabricator.wikimedia.org/T107070#1499077 (10yuvipanda) ooooh, that's great! I don't feel like I've the necessary qualifications to own this task. Do you think you can do that, @jcrespo? I guess it'll primarily be just figuring ou... [17:43:12] * Reedy gives YuviPanda a tape measure [17:43:34] IT BETTER BE IN INCHES I LIVE IN THE USA NOW [17:44:06] 6Labs, 7Database: Measure capacity and utilization of labsdb*** boxes - https://phabricator.wikimedia.org/T107070#1499091 (10jcrespo) a:3jcrespo I lacked the specific goal. The original description was a bit vague. Now I can own this. [17:44:10] I've got one in inches [17:44:30] jynus is going to own the shit out of that task [17:44:41] YuviPanda, but let me for now set is as low [17:45:01] jynus: if you don't think we need to expand in the next month or two, sure :) [17:45:21] it is not that I think, we cannot [17:45:39] but we badly need some machines there or limits [17:45:43] jynus: because of lack of budget or lack of ops time? [17:45:49] both [17:45:52] jynus: I can poke around and see if we do have budget for it [17:46:01] please do [17:46:10] jynus: can you put that on the ticket? :) [17:46:34] the "we badly need some machines"? [17:46:46] yeah :D [17:47:16] We can do a fundraising banner of YuviPanda looking sad [17:50:19] 6Labs, 7Database: Measure capacity and utilization of labsdb*** boxes - https://phabricator.wikimedia.org/T107070#1499120 (10jcrespo) There are no need for specific metrics to know that current hardware for labsdb machines is lacking. There is no basic HA and from time to time we suffer OOMs. There are 2 optio... [17:50:43] sorry, I dropped [17:51:29] YuviPanda: don't forget to also evaluate SGE according to those criteria [17:51:46] 6Labs, 7Database: Measure capacity and utilization of labsdb*** boxes - https://phabricator.wikimedia.org/T107070#1499139 (10yuvipanda) So if we can find a way to add another machine of similar specs, that'll help improve things as well, right? As for load balancing - we currently are doing vague DNS based lo... [17:54:14] phabricator, y u no has strikethrough [17:54:49] valhallasw`cloud: good point [17:55:23] 6Labs, 3Labs-Sprint-107: Evaluate a 'cluster solution' for use on Tool Labs - https://phabricator.wikimedia.org/T106475#1499158 (10yuvipanda) [17:55:31] ah, ~~ ~~ [17:56:18] jynus: can you help spec out an ideal labsdb machine so we can look at how much it's gonna cost? [17:56:26] I also left some questions about load balancing [17:57:51] valhallasw`cloud: any obvious missing bits there, in the 'should at least'? [17:58:35] 6Labs, 3Labs-Sprint-107: Evaluate a 'cluster solution' for use on Tool Labs - https://phabricator.wikimedia.org/T106475#1499162 (10valhallasw) We need a gridengine replacement to schedule and manage arbitrary user applications in a flexible, user friendly way. Potential candidates with substantial adoption, d... [17:58:36] ^ [17:59:24] valhallasw`cloud: do you have a google account I can share a spreadsheet with you with? [17:59:47] valhallasw`cloud: actually, https://docs.google.com/spreadsheets/d/1YkVsd8Y5wBn9fvwVQmp9Sf8K9DZCqmyJ-ew-PAOb4R4/edit?usp=sharing [17:59:49] it's not complete [17:59:52] but that's the full matrix [18:00:42] it's a copy pasta of the earlier thing we used for etcd/consul/zk [18:02:18] ah nice [18:15:14] 6Labs, 6operations, 3Labs-Sprint-107, 3ToolLabs-Goals-Q4: Investigate kernel issues on labvirt** hosts - https://phabricator.wikimedia.org/T99738#1499192 (10Andrew) Update: 3.19 kernels don't crash when I suspend/resume, but the VMs don't come up properly; their clocks are seriously broken such that a sim... [18:21:35] 6Labs, 7Database: Measure capacity and utilization of labsdb*** boxes - https://phabricator.wikimedia.org/T107070#1499200 (10jcrespo) > Are you thinking of doing a different form of load balancing, or just doing what we do now but with HA for user databases? Both. I haven't thought about a particular proposal,... [19:09:59] matanya: is encoding01 doing active work right now? Could we delete it and recreate it later? (I don’t have it in for that instance in particular, just looking to free up some space on that node) [19:11:39] SMalyshev: Is it ok if I shut down and move wdq-beta? It would suffer 10-20 minutes of downtime; if people will notice that then I’ll let it be. [19:18:04] andrewbogott: only wdq-beta? ok [19:18:30] yep, only that one. I’ll move it right now if that’s ok [19:19:32] andrewbogott: ok [19:19:44] please ping me when it's back up [19:39:34] !log video migrating encoding01 to a new virt host. [19:39:38] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Video/SAL, dummy [19:40:03] 6Labs, 10Tool-Labs: Fix 'unknown's in shinken - https://phabricator.wikimedia.org/T99072#1499398 (10valhallasw) a:5valhallasw>3None [19:40:23] YuviPanda: so what do we want to do with the toollabs workboard? kill it? [19:40:40] valhallasw`cloud: not sure.... [19:41:15] paravoi.d said he found it useful, but he's in the middle of fighting outages [19:41:25] 6Labs, 10Tool-Labs: GeoHack failing in Trusty because it can not allocate enough memory? - https://phabricator.wikimedia.org/T107253#1499399 (10valhallasw) p:5Triage>3Normal [19:41:46] YuviPanda: it's mainly that it's sort of duplicate with the labs workboard [19:41:52] I agree [19:42:00] 6Labs, 10Tool-Labs, 5Patch-For-Review: Rewrite the meta_p table populating code to python and have it run on a cron - https://phabricator.wikimedia.org/T107094#1499402 (10valhallasw) p:5Triage>3Low [19:42:24] 6Labs, 10Tool-Labs, 3Labs-Sprint-107: tools bastion accounting logs super noisy, filling /var - https://phabricator.wikimedia.org/T107052#1499404 (10valhallasw) p:5Triage>3Normal [19:43:35] 6Labs, 10Tool-Labs, 7Database: Lost connection to MySQL server during query - https://phabricator.wikimedia.org/T107029#1499408 (10valhallasw) p:5Triage>3High Can you provide some more information? How are you querying, and what is the exact error you're getting? [19:43:45] 6Labs, 10Tool-Labs: Add ability to choose Node.js version in webservice2 - https://phabricator.wikimedia.org/T106961#1499413 (10valhallasw) p:5Triage>3Low [19:44:02] 6Labs, 10Tool-Labs, 15User-Bd808-Test: Create template PHP application for use on Tool Labs based on Slim, Twig and Wikimedia libraries - https://phabricator.wikimedia.org/T90092#1499418 (10valhallasw) p:5Triage>3Low [19:45:01] 6Labs, 10Tool-Labs, 10Wikimedia-Labs-General: Create "next page" function for tools.wmflabs.org - https://phabricator.wikimedia.org/T70390#1499423 (10valhallasw) [19:45:03] 6Labs, 10Tool-Labs: tools.wmflabs.org landing page should not dump all tool accounts - https://phabricator.wikimedia.org/T104917#1499424 (10valhallasw) [19:45:12] 6Labs, 10Tool-Labs: tools.wmflabs.org landing page should not dump all tool accounts - https://phabricator.wikimedia.org/T104917#1499426 (10valhallasw) p:5Triage>3Low [19:57:23] 6Labs, 10Tool-Labs, 7Mail: Mails from tools are being marked as 'spam' by gmail - https://phabricator.wikimedia.org/T104871#1499435 (1001tonythomas) One reason would be lot of failed deliveries arising from the labs domain making gmail consider *.wmflabs.org as spam. [20:05:58] SMalyshev: sorry, this is taking forever — I’ll ping you when things finish. [20:06:42] valhallasw`cloud: think we can mark https://phabricator.wikimedia.org/T107052 as resolved? [20:06:55] andrewbogott: I'm wondering whether we should upstream it [20:07:11] and maybe have a shinken check for it? [20:07:42] probably good ideas, although the situation that produced is (I hope) unusual. [20:07:46] not sure. large pacct files could be a generic sign of trouble [20:08:27] the problem is also that I have no clue how to reproduce the issue, which makes reportign the bug harder [20:09:40] Hm, why is the s51362__erwin85 DB missing on tools [20:12:05] of course the disk space checks caught it eventually [20:13:18] 6Labs, 10Tool-Labs: create diamond reporter & shinken alert for /var/log/account/pacct and/or pacct.1 size - https://phabricator.wikimedia.org/T107617#1499460 (10valhallasw) 3NEW a:3Andrew [20:15:47] 6Labs, 10Tool-Labs: create diamond reporter & shinken alert for /var/log/account/pacct and/or pacct.1 size - https://phabricator.wikimedia.org/T107617#1499467 (10valhallasw) a:5Andrew>3None [20:16:16] 6Labs, 10Tool-Labs: lighttpd does not correctly close connections (CLOSE_WAIT) - https://phabricator.wikimedia.org/T104799#1499468 (10valhallasw) p:5Triage>3High [20:16:31] Nemo_bis: did we restart all servers with the issue in the end? [20:16:35] eh, andrewbogott [20:16:41] why did I type an n there [20:18:07] andrewbogott: what's the kernel version on the server? [20:18:54] nfs server* [20:19:21] which issue [20:19:37] Nemo_bis: the NFS issue, but I meant to highlight Andrew. Sorry :( [20:21:41] 6Labs, 10Tool-Labs, 3Labs-Sprint-107: tools bastion accounting logs super noisy, filling /var - https://phabricator.wikimedia.org/T107052#1499473 (10valhallasw) I propose to upstream this bug, but given that it's quite a complicated issue, I think it's worth the effort to get the bug report right. I've start... [20:22:27] ok [20:28:58] 10Tool-Labs-tools-Erwin's-tools: relatedchanges.php MySQL errors - https://phabricator.wikimedia.org/T107618#1499483 (10Nemo_bis) 3NEW [20:30:46] valhallasw`cloud: set your name in etherpad! [20:44:25] valhallasw`cloud: labstore1002 is 3.19.0-1-amd64 [20:45:18] ok! [20:45:42] I've filled the bug report mostly, but it would be good if you and Coren could take a look at it and maybe add relevant info I missed [20:45:51] valhallasw`cloud: I tried to unmount/remount on those servers that didn’t have enormous logfiles. [20:46:19] I tried remounting on lighttpd-1401, but that didn't seem to work [20:47:31] now that I think of it, I think we only saw trusty hosts? [20:47:32] valhallasw`cloud: It's safe to mark it as resolved. I'm 90% sure what the cause was and we know the fix works. [20:47:32] valhallasw`cloud: a mount -oremount isn't the same as unmounting and remounting, mind you. [20:47:33] ah, ok [20:47:34] Coren: if I understand your cause analysis correctly, it's a kernel bug, right? [20:47:37] Coren: he means the proposed upstream bug. https://etherpad.wikimedia.org/p/T107052 [20:48:55] I don't know if "kernel bug" is right. It's an edge case caused by a weakness in the NFS protocol itself. The kernel could handle it better, probably, but the issue is in the protocol. [20:49:18] And it can only happen if we do something both rare and fairly cruddy. [20:50:42] * Coren reads the etherpad [20:52:00] Yeah, filing upstream is not unreasonable. I'm pretty sure it's going to be closed with an admonition to not restart the server like that, but ynk. :-) [20:52:42] Maybe. OTOH, 'retrying 10k times per second' pretty much sounds like a bug in the client to me [20:53:07] Yeah, definitely crappy handling of the edge case. [20:54:31] 10Tool-Labs-tools-Erwin's-tools: relatedchanges.php MySQL errors - https://phabricator.wikimedia.org/T107618#1499593 (10Nemo_bis) ``` Warning: There were MySQL errors at Fri, 31 Jul 2015 20:53:27 +00002mysql_connect(): Can't connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock' (2)/data/proj... [21:11:31] 6Labs, 10Labs-Infrastructure: Labs virt capacity expansion - https://phabricator.wikimedia.org/T107624#1499662 (10Andrew) 3NEW [21:12:31] 10Tool-Labs-tools-Erwin's-tools: relatedchanges.php MySQL errors - https://phabricator.wikimedia.org/T107618#1499674 (10coren) >>! In T107618#1499593, @Nemo_bis wrote: > ``` > [...]_connect(): Can't connect to local MySQL server [...] That can never work, given how labs is structured (there are no local mysql s... [21:13:05] YuviPanda: please add rmoen and bmansurov to https://wikitech.wikimedia.org/wiki/Nova_Resource:Mobile-smoketests as admin [21:13:24] that project has an instance with a script that is spamming phabricator empty pastes [21:14:02] greg-g: I can do it, one moment... [21:14:10] andrewbogott: thank you kind sir [21:14:46] greg-g: as admins? [21:14:55] yes please [21:15:07] so they can add others in their team, as appropriate [21:15:15] ok, should be all set. [21:25:39] 10Quarry, 7HTTPS: Quarry should be HTTPS-only - https://phabricator.wikimedia.org/T107627#1499757 (10Legoktm) 3NEW [21:36:57] 10Quarry, 7HTTPS: Quarry should be HTTPS-only - https://phabricator.wikimedia.org/T107627#1499816 (10Legoktm) In the meantime... https://github.com/EFForg/https-everywhere/pull/2353 [22:20:51] 6Labs, 10Tool-Labs, 3Labs-Sprint-107: tools bastion accounting logs super noisy, filling /var - https://phabricator.wikimedia.org/T107052#1499971 (10scfc) Regarding "Remounting shares does not solve the issue", @Andrew wrote above that remounting //did// solve the problem. Shouldn't it be possible to replic... [23:25:39] PROBLEM - Puppet staleness on tools-webgrid-lighttpd-1407 is CRITICAL 30.00% of data above the critical threshold [43200.0] [23:36:46] !log services deleting htmldump because I need the space and Gabriel says it’s ok [23:36:51] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Services/SAL, dummy [23:41:29] SMalyshev: all done with the move; sorry it took forever.