[05:25:30] hoi .. bad gateway for Reasonator4 [05:27:58] that’s my fault, I’m working on it [05:28:03] well, maybe [05:28:05] ok [05:32:18] GerardM: better? [05:32:51] Thanks :) [05:35:20] PROBLEM - Puppet failure on tools-exec-1405 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [05:35:20] PROBLEM - Puppet failure on tools-webgrid-generic-1401 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [05:36:04] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1407 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [05:36:06] PROBLEM - Puppet failure on tools-exec-1408 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [05:36:19] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1413 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [05:36:47] PROBLEM - Puppet failure on tools-webgrid-generic-1404 is CRITICAL: CRITICAL: 75.00% of data above the critical threshold [0.0] [05:36:59] PROBLEM - Puppet failure on tools-master is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [05:37:11] PROBLEM - Puppet failure on tools-exec-1202 is CRITICAL: CRITICAL: 57.14% of data above the critical threshold [0.0] [05:37:11] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1406 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [05:37:25] PROBLEM - Puppet failure on tools-exec-1402 is CRITICAL: CRITICAL: 77.78% of data above the critical threshold [0.0] [05:37:59] PROBLEM - Puppet failure on tools-exec-1403 is CRITICAL: CRITICAL: 75.00% of data above the critical threshold [0.0] [05:38:28] PROBLEM - Puppet failure on tools-packages is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [05:38:29] PROBLEM - Puppet failure on tools-mailrelay-02 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [05:38:32] PROBLEM - Puppet failure on tools-proxy-01 is CRITICAL: CRITICAL: 75.00% of data above the critical threshold [0.0] [05:40:08] PROBLEM - Puppet failure on tools-redis-02 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [05:40:43] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1405 is CRITICAL: CRITICAL: 88.89% of data above the critical threshold [0.0] [05:40:55] PROBLEM - Puppet failure on tools-webgrid-generic-1403 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [05:41:09] PROBLEM - Puppet failure on tools-exec-1404 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [05:49:38] PROBLEM - Puppet staleness on tools-exec-cyberbot is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [43200.0] [05:50:26] RECOVERY - Puppet failure on tools-exec-1405 is OK: OK: Less than 1.00% above the threshold [0.0] [05:51:44] 6Labs, 10Labs-Infrastructure, 6operations, 5Patch-For-Review: rename holmium to labservices1002 - https://phabricator.wikimedia.org/T106303#1824353 (10Andrew) labservices1001 is now the primary host; it should be safe to reboot/reinstall/rename holmium at any time. [06:03:33] RECOVERY - Puppet failure on tools-proxy-01 is OK: OK: Less than 1.00% above the threshold [0.0] [06:05:15] RECOVERY - Puppet failure on tools-redis-02 is OK: OK: Less than 1.00% above the threshold [0.0] [06:05:15] RECOVERY - Puppet failure on tools-webgrid-generic-1401 is OK: OK: Less than 1.00% above the threshold [0.0] [06:06:10] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1407 is OK: OK: Less than 1.00% above the threshold [0.0] [06:06:10] RECOVERY - Puppet failure on tools-exec-1408 is OK: OK: Less than 1.00% above the threshold [0.0] [06:06:11] RECOVERY - Puppet failure on tools-exec-1404 is OK: OK: Less than 1.00% above the threshold [0.0] [06:06:24] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1413 is OK: OK: Less than 1.00% above the threshold [0.0] [06:06:52] RECOVERY - Puppet failure on tools-master is OK: OK: Less than 1.00% above the threshold [0.0] [06:07:09] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1406 is OK: OK: Less than 1.00% above the threshold [0.0] [06:08:29] RECOVERY - Puppet failure on tools-mailrelay-02 is OK: OK: Less than 1.00% above the threshold [0.0] [06:08:30] RECOVERY - Puppet failure on tools-packages is OK: OK: Less than 1.00% above the threshold [0.0] [06:10:53] RECOVERY - Puppet failure on tools-webgrid-generic-1403 is OK: OK: Less than 1.00% above the threshold [0.0] [06:11:53] RECOVERY - Puppet failure on tools-webgrid-generic-1404 is OK: OK: Less than 1.00% above the threshold [0.0] [06:12:04] RECOVERY - Puppet failure on tools-exec-1202 is OK: OK: Less than 1.00% above the threshold [0.0] [06:12:28] RECOVERY - Puppet failure on tools-exec-1402 is OK: OK: Less than 1.00% above the threshold [0.0] [06:15:46] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1405 is OK: OK: Less than 1.00% above the threshold [0.0] [06:17:58] RECOVERY - Puppet failure on tools-exec-1403 is OK: OK: Less than 1.00% above the threshold [0.0] [06:18:30] PROBLEM - Puppet staleness on tools-exec-gift is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [43200.0] [08:48:24] 6Labs, 10MediaWiki-extensions-OpenStackManager, 10MediaWiki-General-or-Unknown, 10wikitech.wikimedia.org: DB error when trying to create an account on wikitech - https://phabricator.wikimedia.org/T117553#1824509 (10Beta16) I have already an account on wikitech, but every time I try to access the site retur... [11:39:38] 10Tool-Labs-tools-Other, 6Community-Tech, 7Tracking: Improving Magnus' tools (tracking) - https://phabricator.wikimedia.org/T115537#1824775 (10Ricordisamoa) >>! In T115537#1813261, @Magnus wrote: > Finally, someone realized just how much shoddy code I deployed! :-) Your tools are constantly praised and alwa... [12:16:29] 6Labs, 10Tool-Labs, 6operations, 7Database: Replication broken on labsdb1002. - https://phabricator.wikimedia.org/T119315#1824834 (10jcrespo) And this is why your DBA cannot take a single day of vacations. [12:17:45] 6Labs, 10Tool-Labs, 6operations, 7Database: Replication broken on labsdb1002. - https://phabricator.wikimedia.org/T119315#1823353 (10jcrespo) Waiting for replication lag to go back to 0 to close this ticket. [12:40:47] 10Tool-Labs-tools-Other, 6Community-Tech, 7Tracking: Improving Magnus' tools (tracking) - https://phabricator.wikimedia.org/T115537#1824859 (10Magnus) >>! In T115537#1824775, @Ricordisamoa wrote: >>>! In T115537#1813261, @Magnus wrote: >> Finally, someone realized just how much shoddy code I deployed! :-) >... [12:49:45] 6Labs, 10Tool-Labs, 6operations, 7Database: Replication broken on labsdb1002. - https://phabricator.wikimedia.org/T119315#1824876 (10Luke081515) @jcrespo Replag at dewiki and commons is still growing :-( [12:51:52] 6Labs, 10Tool-Labs, 6operations, 7Database: Replication broken on labsdb1002. - https://phabricator.wikimedia.org/T119315#1824878 (10jcrespo) Sadly, after a server crashes, it empties its caches, so I would expect it to be growing for a while. I am monitoring it, just in case there is more issues- will kee... [12:59:13] 6Labs, 10Tool-Labs, 6operations, 7Database: labsdb1002 and labsdb1003 crashed, affecting replication - https://phabricator.wikimedia.org/T119315#1824883 (10jcrespo) a:3jcrespo [13:00:18] 6Labs, 10Labs-Infrastructure, 10Wikidata: Labs DB replication of Wikidata has stopped - https://phabricator.wikimedia.org/T119382#1824886 (10Magnus) 3NEW [13:01:15] 6Labs, 10Labs-Infrastructure, 10Wikidata: Labs DB replication of Wikidata has stopped - https://phabricator.wikimedia.org/T119382#1824894 (10jcrespo) [13:01:17] 6Labs, 10Tool-Labs, 6operations, 7Database: labsdb1002 and labsdb1003 crashed, affecting replication - https://phabricator.wikimedia.org/T119315#1824895 (10jcrespo) [14:31:26] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Dorgold was created, changed by Dorgold link https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Access_Request/Dorgold edit summary: Created page with "{{Tools Access Request |Justification=phd research project on art historical articles on wikipedia |Completed=false |User Name=Dorgold }}" [14:54:47] 6Labs, 10Tool-Labs, 5Patch-For-Review: webgrid nodes have very limited swap (500MB) - https://phabricator.wikimedia.org/T118419#1825092 (10coren) 5Open>3Resolved a:3coren This was done over Nov 20, with all web nodes now sporting a shiny new /tmp and labs_lvm::swap [15:06:08] 6Labs, 10Tool-Labs: Puppet disabled on tools-exec-cyberbot, tools-exec-gift - https://phabricator.wikimedia.org/T119389#1825099 (10valhallasw) 3NEW [15:07:00] 6Labs, 10Tool-Labs: Puppet disabled on tools-exec-cyberbot, tools-exec-gift - https://phabricator.wikimedia.org/T119389#1825110 (10coren) Ah, yes - their puppet was not restarted at the end of Friday's switchover. Fixing shortly. [15:08:11] 6Labs, 10Tool-Labs: tools-service-01: Error: Could not update: Execution of '/usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold install tools-manifest' returned 100 - https://phabricator.wikimedia.org/T119390#1825111 (10valhallasw) 3NEW [15:17:59] 6Labs, 10Labs-Team-Backlog, 3ToolLabs-Goals-Q4: Labs NFSv4/idmapd mess - https://phabricator.wikimedia.org/T87870#1825168 (10coren) NFS was restarted on labstore1001 and its memory map confirms that the LDAP shim is in place: ``` 7f4c9b5f4000-7f4c9b5f5000 r-xp 00000000 09:00 53870988 /usr/... [15:22:14] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Dorgold was modified, changed by Tim Landscheidt link https://wikitech.wikimedia.org/w/index.php?diff=207899 edit summary: [15:23:10] commonswiki_p has replag O_O [15:23:51] 6Labs, 10Labs-Infrastructure: Cannot edit my files on Tools Labs via group - https://phabricator.wikimedia.org/T119393#1825195 (10Magnus) 3NEW [15:23:53] 6Labs, 10wikitech.wikimedia.org: "Edit with form" missing on a Tools access request page - https://phabricator.wikimedia.org/T118136#1825205 (10scfc) Another instance: https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Access_Request/Dorgold. [15:24:03] Steinsplitter: https://phabricator.wikimedia.org/T119315 [15:32:07] Thanks, Glaisher :) [15:36:02] 6Labs, 10Tool-Labs, 6operations, 7Database: labsdb1002 and labsdb1003 crashed, affecting replication - https://phabricator.wikimedia.org/T119315#1825257 (10jcrespo) So, the recovery thread was still running, efectively setting the hosts in read-only mode. labsdb1002 has finally started its replication proc... [15:53:26] 6Labs, 6operations: Untangle labs/production roles from labs/instance roles - https://phabricator.wikimedia.org/T119401#1825326 (10Andrew) 3NEW a:3Andrew [15:54:12] 6Labs, 10Labs-Infrastructure: Investigate decommissioning labcontrol2001 - https://phabricator.wikimedia.org/T118591#1825334 (10Andrew) a:3Andrew [16:22:47] 6Labs, 10Tool-Labs: Migrate some tools nodes away from labvirt1002, it's getting full - https://phabricator.wikimedia.org/T119399#1825400 (10Krenair) [16:31:42] 6Labs, 10wikitech.wikimedia.org: "Edit with form" missing on a Tools access request page - https://phabricator.wikimedia.org/T118136#1825418 (10Krenair) Nope, both show "Edit with form". [16:40:34] 6Labs: Paramiko broken in sink_nova_ldap - https://phabricator.wikimedia.org/T119408#1825470 (10Krenair) [16:52:52] 6Labs, 10Tool-Labs, 6operations, 7Database: labsdb1002 and labsdb1003 crashed, affecting replication - https://phabricator.wikimedia.org/T119315#1825528 (10jcrespo) Corruption is still on, I am tring to repair the tables manually. For that, I need to put the labsdb servers down for some time. This may take... [16:56:04] 6Labs, 7Database: s2/s51072__dwl_p is stuck - https://phabricator.wikimedia.org/T119305#1825543 (10Giftpflanze) 5Open>3Resolved a:3Giftpflanze [17:00:49] 6Labs, 10MediaWiki-Vagrant, 5Patch-For-Review, 15User-bd808: MediaWiki-Vagrant fails due to dpkg post-installation script of hhvm - https://phabricator.wikimedia.org/T115450#1825584 (10bd808) [17:01:04] 6Labs, 10Labs-Infrastructure, 10MediaWiki-Vagrant, 5Patch-For-Review, 15User-bd808: Forwarded ports are bound only to localhost - https://phabricator.wikimedia.org/T115139#1825598 (10bd808) [17:01:06] 6Labs, 10Labs-Infrastructure, 10MediaWiki-Vagrant, 5Patch-For-Review, 15User-bd808: sudo prompts for my password running `vagrant up` in labs::mediawiki_vagrant role - https://phabricator.wikimedia.org/T115080#1825600 (10bd808) [17:20:31] 10Tool-Labs-tools-Global-user-contributions, 6Stewards-and-global-tools: Global user contributions doesn't work - https://phabricator.wikimedia.org/T119414#1825671 (10Vituzzu) 3NEW [17:20:45] 6Labs, 10Labs-Infrastructure: Cannot edit my files on Tools Labs via group - https://phabricator.wikimedia.org/T119393#1825682 (10coren) I'm on it. This is almost certainly related to T87870 work although LDAP has not been disabled yet. [17:21:07] 6Labs, 10Labs-Infrastructure: Cannot edit my files on Tools Labs via group - https://phabricator.wikimedia.org/T119393#1825684 (10coren) a:3coren [17:24:19] 10Tool-Labs-tools-Global-user-contributions, 6Stewards-and-global-tools: Global user contributions doesn't work - https://phabricator.wikimedia.org/T119414#1825692 (10Glaisher) caused by {T119315} [17:33:22] YuviPanda: 16:00UTC is over, isn't it? Or do you mean tomorrow 16:00 UTC? [17:38:49] 6Labs, 10Labs-Infrastructure: Cannot edit my files on Tools Labs via group - https://phabricator.wikimedia.org/T119393#1825763 (10coren) This has been fixed by a hotfix undoing https://gerrit.wikimedia.org/r/#/c/254176/ More investigation is needed. [17:38:51] Luke081515: coren would know but I think he's still debugging something [17:39:10] 6Labs, 10Labs-Infrastructure: Cannot edit my files on Tools Labs via group - https://phabricator.wikimedia.org/T119393#1825767 (10coren) p:5Unbreak!>3High (It's working now, but needs investigation) [17:39:21] when i connect to database and lose connection i cannot reconnect [17:39:32] i have to exit the script and restart it [17:50:58] 6Labs, 10Tool-Labs, 6operations, 7Database: labsdb1002 and labsdb1003 crashed, affecting replication - https://phabricator.wikimedia.org/T119315#1825876 (10jcrespo) I think I may have fixed the corruption on labsdb1003, but sadly some data was lost in the process. I cannot guarantee the accuracy of its con... [17:56:30] 10Tool-Labs-tools-Other, 6Community-Tech, 7Tracking: Improving Magnus' tools (tracking) - https://phabricator.wikimedia.org/T115537#1825912 (10Ricordisamoa) >>! In T115537#1824859, @Magnus wrote: >> I'm very glad you're being timely on merging pull requests, but I was referring to the time they take to hit p... [18:03:08] 6Labs: Add a note to wdq-mm error page about query.wikidata.org - https://phabricator.wikimedia.org/T119123#1825934 (10yuvipanda) p:5Triage>3Normal [18:03:24] 6Labs, 5Patch-For-Review: Convert all ldap globals into hiera variables instead - https://phabricator.wikimedia.org/T101447#1825937 (10yuvipanda) p:5Triage>3Low [18:03:35] 6Labs: Write up a report about kubecon to the Ops team - https://phabricator.wikimedia.org/T118757#1825938 (10yuvipanda) p:5Triage>3Low [18:04:02] 6Labs, 6operations: Setup private docker registry with authentication support in tools - https://phabricator.wikimedia.org/T118758#1825939 (10yuvipanda) p:5Triage>3Normal [18:04:42] 6Labs, 10Tool-Labs, 7Documentation: Wikimedia Labs system admin (sysadmin) documentation sucks - https://phabricator.wikimedia.org/T57946#1825942 (10coren) a:5coren>3None [18:04:50] gifti: it's probably the ongoing labsdb issues [18:04:51] 6Labs, 10Tool-Labs, 7Puppet: Document our GridEngine set up - https://phabricator.wikimedia.org/T88733#1825945 (10coren) a:5coren>3None [18:05:00] 6Labs, 10Tool-Labs: Document how to turn shadow into master - https://phabricator.wikimedia.org/T91133#1825947 (10coren) a:5coren>3None [18:05:11] 6Labs, 10Tool-Labs: Document / get rid of jobkill.pl - https://phabricator.wikimedia.org/T91233#1825948 (10coren) a:5coren>3None [18:11:34] 6Labs, 10Labs-Infrastructure: Labstore primary needs more frequent cleanup of snapshots - https://phabricator.wikimedia.org/T109176#1825982 (10coren) 5Open>3Resolved It is; I've been keeping an eye on it and all is good. [18:12:26] 6Labs, 10Tool-Labs, 7Documentation: add basic expectations management to docs - https://phabricator.wikimedia.org/T56701#1825988 (10coren) a:5coren>3None [18:13:22] 6Labs, 10Tool-Labs, 6operations, 7Database: labsdb1002 and labsdb1003 crashed, affecting replication - https://phabricator.wikimedia.org/T119315#1825994 (10Luke081515) [18:13:24] 10Tool-Labs-tools-Global-user-contributions, 6Stewards-and-global-tools: Global user contributions doesn't work - https://phabricator.wikimedia.org/T119414#1825993 (10Luke081515) [18:23:40] 6Labs, 6operations: Untangle labs/production roles from labs/instance roles - https://phabricator.wikimedia.org/T119401#1826037 (10Andrew) a:5Andrew>3yuvipanda [18:36:03] 6Labs, 10Tool-Labs, 6operations, 7Database: labsdb1002 and labsdb1003 crashed, affecting replication - https://phabricator.wikimedia.org/T119315#1826080 (10jcrespo) Short term fix done on labsdb1002, too: ``` MariaDB LABS localhost (none) > pager grep Seconds PAGER set to 'grep Seconds' MariaDB LABS local... [18:39:37] 6Labs, 10Tool-Labs, 6operations, 7Database: labsdb1002 and labsdb1003 crashed, affecting replication - https://phabricator.wikimedia.org/T119315#1826094 (10Multichill) @jcrespo: Do you update dns entries before you kill a database server? We use names like "wikidatawiki.labsdb" and that way the impact is r... [18:41:14] 6Labs, 10Tool-Labs, 6operations, 7Database: labsdb1002 and labsdb1003 crashed, affecting replication - https://phabricator.wikimedia.org/T119315#1826097 (10jcrespo) @Multichill No, that was done on pourpose. The servers were killed because an OOM, as we have 2/3 servers down, redirecting to a single server... [18:44:23] 6Labs, 10Tool-Labs, 6operations, 7Database: labsdb1002 and labsdb1003 crashed, affecting replication - https://phabricator.wikimedia.org/T119315#1826103 (10Betacommand) I know this is more work, but I think the community would prefer that you do a fresh import of the data. Having an unknown volume of data... [18:46:23] * YuviPanda waves [18:46:54] 6Labs, 10Tool-Labs, 6operations, 7Database: labsdb1002 and labsdb1003 crashed, affecting replication - https://phabricator.wikimedia.org/T119315#1826118 (10jcrespo) @Betacommand I hear you, but doing a full import will take down the servers for 5 days. Having only 1 server available and up to date is not a... [18:47:46] So, 2 days ago, labsdb1002 and labsdb1003 crashed [18:47:58] origin is a kernel OOM [18:48:30] jynus: Just so you could have a nice challenge to work on [18:48:43] Good luck with patching it up [18:48:56] not too strange given that users sometime do huge queries [18:49:11] not a huge issue, server restarts automatically [18:49:18] I was wondering. Once you have everything patched up, can't you just create a new VM and do the database import there? Or do you have a resource problem here? [18:49:35] resources are the source of everithing [18:49:56] if we had enough resources, databases would be running in a more consistent mode [18:50:14] but we do not have the hardware to afford that [18:50:23] Define resources in this sense. Money? Disk space? IOPS? CPU? Memory? etc [18:50:33] also, it happened on my only day of holidays [18:50:39] and there is not other DBA [18:50:56] I read that jynus. I hope you found out after you only day [18:50:59] time primarily I would venture (we only have one DBA right now) [18:51:00] I am also in chanrge of production, research/analytics and other miscelan databases [18:51:24] in terms of hardware, number of servers [18:51:29] and IOPS [18:51:34] So, the WDQ database crashed, query.wikidata.org went belly up and the labs databases crashed [18:51:37] I see a pattern here [18:52:03] we are at 100% cpu almost all time [18:52:42] memory can solve partially some iops, but not for our whole dataset [18:52:55] That's scary. Are you all SSD for the database servers? [18:53:06] additionally, in order to fit all data in a single server, we need to compress using tokudb [18:53:14] and use paralel replication [18:53:41] which was probably the cause of replication being stopped but not providing feedback of why [18:54:07] I upgraded the database to the latest version and run the check script in order to make it work [18:54:10] is that why the server restarted and replication didn't [18:54:12] ? [18:54:17] yep [18:54:26] strange thing-a bug [18:54:45] multichill, SSDs? for labs? that is a dream [18:54:52] jynus: In the past replication crashed because the production db's were changed, but downstream wasn't notified [18:54:53] we do not have those neigher in production [18:55:07] All flash is getting cheaper every day [18:55:35] multichill, I agree, but there is 150 servers to maintain... 2-3 terabytes each [18:55:44] and how many in labs, 8? [18:55:55] All flash only for the servers that need it (db's) [18:56:06] 150 are the dbs [18:56:41] Production servers do use SSD. I pretty sure of that because it took quite a bit of work to put those in! [18:56:57] Or at least the newest caches in Haarlem [18:57:01] databasses do not use SSDs [18:57:13] most of them are 3-year old servers [18:57:36] I think we are buying our first SSD database in a few weeks [18:57:42] multichill: you're thinking of the varnish caches. also I'd like to let jynus finish what he was saying before we go down hardware land :) [18:57:48] yes [18:57:52] so, current state [18:58:10] Yeah, good luck [18:58:13] databases are replicating succesfully, slowly getting bac to sync [18:58:44] in order to make this work, I had to ignore some errors when updating [18:58:55] so there may be some inconsistencies [18:59:15] important ones should be reported on the ticket or new tickets [18:59:30] support tables will be fixed automatically every week [18:59:47] others will be checked manually, or, as someone suggested, reimported [18:59:55] so jynus is this inconsistencies in user dbs? or inconsistencies in the replica data? [18:59:57] but that has to wait until we are upd to date [19:00:16] used dbs cannot never guarantee consistenct [19:00:24] because they use myisam mostly [19:00:29] and that is not crash-safe [19:00:30] ah ofc [19:00:32] right [19:00:43] recomending innodb to fix that [19:00:53] but that is up to each user [19:00:59] so, the idea is [19:01:11] the blocking thing has been done [19:01:22] we have not to wait for replication to catch up [19:01:29] *now not not [19:01:39] and will reevaluate when that happens [19:01:50] also inviting users to report additional issues [19:02:05] does that make sense? [19:02:10] yeah [19:02:13] so basically: [19:02:15] if someone asks about replicas being late [19:02:19] 1. replag is catching up slowly [19:02:21] or missing data [19:02:22] yes [19:02:28] 2. some data might be missing / corrupt and should be reported [19:02:50] 3. when replication has caught up we can do consistency checks [19:02:57] is that all fair to say? [19:03:15] yes [19:03:20] that is a good summary [19:03:37] after 2, we will reevaluate [19:03:48] and schedule downtime appropiatelly [19:03:53] ok! [19:03:54] for proper maintenance [19:04:03] do you have vague ETAs for replication catchup? [19:04:03] if needed [19:04:09] like, hours? days? [19:04:13] I was waiting to give you that [19:04:22] one sec so I can calculate [19:05:27] while I am looking at this [19:05:36] awesome [19:05:54] I thought of changing the dns, I prefered not, because I didn't want to bring down the one we had up [19:06:06] specially when 2/3 servers are problematic [19:06:51] yeah +1 [19:07:54] 2'7 seconds per second [19:07:57] which means [19:08:34] less than 10 hours [19:09:07] can I copy your summary and paste on the ticket? [19:09:37] jynus: please do! [19:09:39] jynus: I agree in this case. Would have sucked if you brought down the remaining server [19:15:13] andrewbogott: chasemp I think new instance creation might be broken? [19:15:24] mdc-puppetmaster (ori's new instnace) is doa and unreachable [19:15:42] ^ andrewbogott maybe related to dns issues? [19:16:01] doesn't respond to ping nor ssh [19:16:11] but console says it's at login prompt and has run puppet [19:16:26] hm [19:16:31] what labvirt did it end up on? [19:16:36] dunno [19:16:43] horizon's broken (won't let me switch projects) [19:16:47] let me ssh to labcontrol and find out [19:16:56] YuviPanda: I’ve run a dozen tests this morning, has always worked for me... [19:17:04] I'm stuck in the middle of something atm but I'll try in a few [19:17:09] can you look at mdc-puppetmaster? [19:17:35] what project? [19:18:50] andrewbogott: mdc [19:19:26] YuviPanda: security groups [19:20:30] 6Labs, 10Tool-Labs, 6operations, 7Database: labsdb1002 and labsdb1003 crashed, affecting replication - https://phabricator.wikimedia.org/T119315#1826180 (10jcrespo) So, to give you a general update: 1. After an upgrade and configuration change to workaround a bug in the storage engine/paralel replication,... [19:21:28] YuviPanda: I added 22 but ori might want to fill in the other empty security groups [19:21:36] ah [19:21:38] ok [19:21:44] andrewbogott: why didn't they get added by default... [19:23:10] YuviPanda: dunno, sometimes ‘default’ is empty in new projects but never when I’m watching [19:23:17] heh [19:32:20] 6Labs, 10Tool-Labs, 6operations, 7Database: labsdb1002 and labsdb1003 crashed, affecting replication - https://phabricator.wikimedia.org/T119315#1826209 (10jcrespo) p:5Unbreak!>3High [20:26:36] jynus: Replag grows at the moment [20:27:39] Luke081515, let me check [20:28:11] (03PS10) 10ArthurPSmith: Added a Wikidata-based "chart of the nuclides" under /nuclides [labs/tools/ptable] - 10https://gerrit.wikimedia.org/r/245591 [20:28:21] when I reload http://tools.wmflabs.org/betacommand-dev/cgi-bin/replag, the Lag (Time) gets bigger [20:28:47] labsdb1003 is ok [20:28:54] labsdb1002 is not [20:28:55] Luke081515: what wiki are you looking at? [20:29:04] dewiki, and commons [20:29:15] those are labsdb1002 [20:29:19] but others with replag > 1 day grow to [20:29:44] Luke081515: you need to take that lag value with understanding of how its calculated [20:29:55] no, labsdb1003 is down to 22 hours already [20:30:14] jynus: ok, but originaly this lag gets smaller, two hours ago, there was a replag bigger than 2 days [20:30:38] yes, as I said, labsdb1002 has issues [20:31:09] hm, ok [20:31:17] it takes a bit to confirm them and know why [20:31:30] (03CR) 10ArthurPSmith: "Ok - some significant changes in this version:" [labs/tools/ptable] - 10https://gerrit.wikimedia.org/r/245591 (owner: 10ArthurPSmith) [20:32:10] it crashed again, this time due to [20:32:22] Table './p50380g50803__mp/mentee_mentor' is marked as crashed and should be repaired [20:32:37] Luke081515: that tool uses the standard dns records for each database, and shows time since most recent action on the wiki. On major wikis the difference is not noticeable, but on smaller wikis "lag" might be 12+ hours if no one has done anything on wiki, regardless of the actual database state [20:32:59] yes, that is why I have almost finished an alternative [20:33:07] which will be more accurate [20:33:11] betacommond: Yes, but dewiki and commons are not small wiki ;) [20:33:16] *small wikis [20:33:56] jynus: however I have seen the lag counter from the database be just as screwed up at times. [20:34:23] So I really dont know of a good way of doing it foolproof. [20:34:27] no, it is a new thing [20:34:36] based on production changes [20:34:44] will send more info soon [20:35:05] (03PS11) 10ArthurPSmith: Added a Wikidata-based "chart of the nuclides" under /nuclides [labs/tools/ptable] - 10https://gerrit.wikimedia.org/r/245591 [20:36:10] (03CR) 10ArthurPSmith: "Fixed some pep8 complaints - also removed debug mode." [labs/tools/ptable] - 10https://gerrit.wikimedia.org/r/245591 (owner: 10ArthurPSmith) [20:36:41] 6Labs, 10Tool-Labs, 6operations, 7Database: labsdb1002 and labsdb1003 crashed, affecting replication - https://phabricator.wikimedia.org/T119315#1826434 (10jcrespo) p:5High>3Unbreak! Labsdb1002 crashed again, creating new corruption on some user tables: ``` 151123 20:02:14 [ERROR] mysqld: Table './p50... [20:40:36] jynus: Good work, replag shrinks again ;) [20:40:52] not so sure about that [20:41:05] wanna double check [20:47:49] jynus> !log restarting labsdb1002, hopfully for the last time [20:49:21] YuviPanda, puppetmaster::self is broken:( Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Could not find resource 'Exec[compile puppet.conf]' for relationship on 'Class[Puppetmaster::Ssl]' on node puppet-test02.maps-team.eqiad.wmflabs [20:50:10] or it is barfing at something else? [20:50:16] * YuviPanda checks [20:52:35] Luke081515, now it should work every time [20:53:33] I am also going to make sure the recovery happens automatically, without me interveening [20:55:46] jynus: Thank your for your work [20:55:56] (I have to go now) [20:56:04] Luke081515, thank you for the heads up! [21:08:20] 6Labs, 10Tool-Labs, 6operations, 7Database: labsdb1002 and labsdb1003 crashed, affecting replication - https://phabricator.wikimedia.org/T119315#1826523 (10jcrespo) After fixing this issue (and setting so that if databases crash, they can recover automatically), I will enforce stricter per-user query limit... [21:16:20] 6Labs, 5Patch-For-Review: Paramiko broken in sink_nova_ldap - https://phabricator.wikimedia.org/T119408#1826554 (10Andrew) 5Open>3Resolved Replaced with shelling out to ssh :( [21:23:37] 6Labs, 10wikitech.wikimedia.org: "Edit with form" missing on a Tools access request page - https://phabricator.wikimedia.org/T118136#1826580 (10scfc) 5Open>3Resolved a:3scfc Yep, now all do again. *argl* [21:34:56] PROBLEM - ToolLabs Home Page on toollabs is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:39:49] RECOVERY - ToolLabs Home Page on toollabs is OK: HTTP OK: HTTP/1.1 200 OK - 932212 bytes in 3.494 second response time [21:47:27] 6Labs, 10Labs-Infrastructure: Cannot edit my files on Tools Labs via group - https://phabricator.wikimedia.org/T119393#1826649 (10coren) 5Open>3Resolved The underlying cause has been squished with https://gerrit.wikimedia.org/r/#/c/254176/ [22:00:52] 6Labs, 6Discovery, 7Elasticsearch: Replicate production elasticsearch indices to labs - https://phabricator.wikimedia.org/T109715#1826685 (10kaldari) [22:01:02] 6Labs, 6Discovery, 7Elasticsearch: Replicate production elasticsearch indices to labs - https://phabricator.wikimedia.org/T109715#1557225 (10kaldari) [22:29:04] MaxSem: btw I think the problem is elsewhere... my self hosted puppetmasters seem ok [22:29:29] YuviPanda, any suggestions how to investigate? [22:30:55] I can take a quick look [22:31:01] wait [22:31:05] what class are you using? [22:31:12] the one in the wikitech configure page right? [22:31:49] yup [22:32:17] then I checkout my revision and it fails even though the class is not applied anywhere [22:32:35] and keeps on failing even after being reset to production master [22:32:44] MaxSem: have you pulled to latest puppet? [22:51:12] MaxSem: I tried running puppet and it just seems to hang.. [22:51:19] * YuviPanda has to go off now willbrb in a couple hours [23:57:41] 6Labs, 6Discovery, 7Elasticsearch, 5Patch-For-Review: Replicate production elasticsearch indices to labs - https://phabricator.wikimedia.org/T109715#1827077 (10EBernhardson) nobelium looks reasonably happy with only enwiki and dewiki turned on. disk read/write is around 20MB/s. When we did this before (wi...