[07:03:20] 10DBA, 13Patch-For-Review: Import S2,S6,S7,m3 and x1 to dbstore2001 and dbstore2002 - https://phabricator.wikimedia.org/T151552#2889135 (10Marostegui) s5 has been compressed. Now compressing s6. [07:07:25] 10DBA, 13Patch-For-Review: db2034: investigate its crash and reimage - https://phabricator.wikimedia.org/T149553#2889153 (10Marostegui) p:05Triage>03High [07:11:32] 10DBA: db1047 paged low on disk space - https://phabricator.wikimedia.org/T153634#2889161 (10Marostegui) Thanks for the input @fgiunchedi! In order to avoid it paging during the holidays period this is what I have done: ``` root@db1047:/var/log# mkdir /srv/tmp/log_bkup root@db1047:/var/log# cp -r atop/ account/... [07:13:57] 10DBA: db1047 paged low on disk space - https://phabricator.wikimedia.org/T153634#2889173 (10Marostegui) Also cleaned `apt` cache [07:16:44] 10DBA: db1047 paged low on disk space - https://phabricator.wikimedia.org/T153634#2889193 (10Marostegui) 05Open>03Resolved a:03Marostegui [07:21:55] 10DBA: Defragment db1015 - https://phabricator.wikimedia.org/T153739#2889198 (10Marostegui) [07:29:51] 10DBA: Defragment db1015 - https://phabricator.wikimedia.org/T153739#2889222 (10Marostegui) The first iteration over the pagelinks pages (smaller than 1G) was done. It gave us back around 25G. I am now going to proceed with the pagelinks tables bigger than 1G which are 50 of them. I believe that will give us bac... [07:31:10] 10DBA: Defragment db1015 - https://phabricator.wikimedia.org/T153739#2889225 (10Marostegui) [07:55:51] 10DBA, 06Operations, 10ops-codfw: Degraded RAID on db2011 - https://phabricator.wikimedia.org/T153740#2889265 (10Marostegui) p:05Triage>03Normal [07:59:12] 10DBA, 06Operations, 10ops-codfw: Degraded RAID on db2011 - https://phabricator.wikimedia.org/T153740#2889258 (10Marostegui) This is correct, please proceed and change the disk ``` Device Present ================ Virtual Drives : 1 Degraded : 1 Offline :... [09:41:19] how long does the check_private_data.py takes normally now? i recall it taking around 20 minutes or so? [09:43:37] yeah, sometimes more [09:43:51] the one on labs took me >1h [09:43:57] ah ok [09:44:02] probably because everething was so cold [09:44:11] i am running it on labdsb1009 (I am going to start playing aroud with the report script/alert) [09:44:18] ok [09:44:45] should we add a --quiet option that only prints errors or something? [09:45:19] we will see, let me take a look at the output and all that and i will give it a thought [09:45:54] we can already puppetize a cron that logs to the same file [09:46:01] once per day [09:46:11] then see what we do with the output [09:46:23] yep, that was exactly my idea [09:54:14] 10DBA: Pending things in the labs infra - https://phabricator.wikimedia.org/T153058#2889393 (10jcrespo) [09:55:27] 10DBA: Pending things in the labs infra - https://phabricator.wikimedia.org/T153058#2889394 (10Marostegui) [09:55:34] 10DBA: LabsDB infrastructure pending work - https://phabricator.wikimedia.org/T153058#2889397 (10jcrespo) [10:00:45] 10DBA: LabsDB infrastructure pending work - https://phabricator.wikimedia.org/T153058#2889405 (10jcrespo) [10:02:32] 10DBA, 06Labs, 10Labs-Infrastructure: Create a cronjob/check to run check_private_data data script and report back - https://phabricator.wikimedia.org/T153680#2889408 (10jcrespo) [10:02:35] 10DBA, 13Patch-For-Review: Deploy gtid_domain_id flag in our mysql hosts - https://phabricator.wikimedia.org/T149418#2889409 (10jcrespo) [10:02:37] 10DBA: LabsDB infrastructure pending work - https://phabricator.wikimedia.org/T153058#2868122 (10jcrespo) [10:02:40] 10DBA, 06Labs: Make watchlist table available as curated foo_p.watchlist_count on labsdb - https://phabricator.wikimedia.org/T59617#2889410 (10jcrespo) [10:02:57] I really need to work on search for quarry [10:09:37] hmm [10:09:38] 10DBA, 06Labs: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts - https://phabricator.wikimedia.org/T153743#2889412 (10Marostegui) [10:09:45] SELECT COUNT(*) FROM page WHERE page_namespace=10 AND page_title NOT LIKE '%/%'; [10:09:47] gives me different numbers on labsdb1009 and 1001 [10:09:57] 1009 has 4 less [10:10:04] which wiki? [10:10:13] 10DBA: LabsDB infrastructure pending work - https://phabricator.wikimedia.org/T153058#2889424 (10Marostegui) [10:10:40] 10DBA, 06Labs: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts - https://phabricator.wikimedia.org/T153743#2889412 (10Marostegui) [10:10:42] 10DBA: LabsDB infrastructure pending work - https://phabricator.wikimedia.org/T153058#2868122 (10Marostegui) [10:11:25] yuvipanda, which wiki? [10:11:36] jynus: enwiki [10:11:42] (sorry didn't see ping, was looking for other queries) [10:14:48] yep, db1095 (sanitarium2) also has 4 less (which is consistent with 1009) [10:15:03] yes, one of the many cases where current labs hosts are wrong [10:15:29] but new ones match production [10:15:31] ah [10:15:37] right, that makes sense. [10:15:59] yeah, I can see this in others too [10:16:14] https://quarry.wmflabs.org/query/6751 has 6 extra rows in labsdb1009 than 1001 [10:16:41] also the query, once warmed up takes 1.46 sec vs. 0.36 sec [10:17:18] yeah! [10:17:19] sorry [10:17:21] actually [10:17:31] 0.36s vs 4.05 sec [10:17:42] the promised 5x improvement [10:17:54] I saw improvements on some other queries after a few runs [10:17:58] nice :) [10:18:03] (although with a big * due to not having load right now) [10:18:09] yeah [10:18:17] hopefully it will maintain that once we add the other shards [10:18:31] but I expect the new servers to have more throughput, which is more important [10:18:47] right [11:34:09] jynus: most of the checks I did seem good to me :D [11:35:17] \o/ [11:41:21] marostegui: :D [13:44:00] jynus: marostegui I'm going to deploy the new account creation script tonight. Objections? [13:44:17] yuvipanda: nop from my side :) [13:44:32] yuvipanda: is there anything that can break? [13:44:48] marostegui: new user creation is the only thing. I'll verify that after deploying ofc :) [13:45:11] :-) [13:45:34] I will do latter some stress testing/failover [13:45:45] will ping here [13:46:05] cool! thanks [13:48:03] marostegui: jynus these won't have access to labsdb1009/10/11 yet ofc because of firewall rules, so shouldn't interfere with your testing [13:51:29] ah right, thanks for the heads up [14:13:16] 10DBA, 13Patch-For-Review: db2034: investigate its crash and reimage - https://phabricator.wikimedia.org/T149553#2889836 (10Marostegui) http://h20564.www2.hpe.com/hpsc/doc/public/display?docId=emr_na-c04046303: ``` HP ProLiant Gen8 servers configured with Intel Xeon E5-2400 v2 or E5-2600 v2 or E5-4600 v2 seri... [14:40:56] 10DBA, 13Patch-For-Review: db2034: investigate its crash and reimage - https://phabricator.wikimedia.org/T149553#2889880 (10Marostegui) As per the above message, I think we are not suffering that: ``` root@db2034:~# dmidecode --type memory | egrep "Speed|Configured Clock" | grep -v Unkn Speed: 1866 MHz Co... [16:03:31] there is a repl_records on labsdb1003 [16:03:48] repl_records? [16:03:53] sorry [16:03:57] it is on db1069 [16:04:06] I have no idea what that is [16:04:42] there is an event called repl_records_partition that partitions dynamically repl_records [16:04:51] :| [16:04:58] in all the instances? [16:06:01] in s6 and s7 only, apparently [16:07:36] CREATED: 2014-09-27 09:08:56 [16:07:36] LAST_ALTERED: 2014-09-27 09:08:56 [16:07:36] LAST_EXECUTED: 2016-12-20 15:09:06 [16:07:48] looks ancient [16:09:23] I will drop all [16:09:52] if it is not on s1-s5, it will not do nothing [16:10:00] plus the table it alters does not exist [16:10:19] but it generates an error every few second on the error log [16:10:42] indeed [16:25:44] 10DBA: Defragment db1015 - https://phabricator.wikimedia.org/T153739#2890210 (10Marostegui) We got around 50G back from optimizing pagelinks. ``` root@db1015:/srv/sqldata# df -hT /srv/ Filesystem Type Size Used Avail Use% Mounted on /dev/mapper/tank-data xfs 1.6T 1.4T 194G 88% /srv ``` I have... [16:30:45] 10DBA, 06Operations, 10ops-codfw: Degraded RAID on db2011 - https://phabricator.wikimedia.org/T153740#2890221 (10RobH) Please note that @papaul is currently away from the datacenter until after the holiday. Any hardware failures will either wait until his return, or will require a smart hands ticket with Cy... [16:35:34] 10DBA, 06Operations, 10ops-codfw: Degraded RAID on db2011 - https://phabricator.wikimedia.org/T153740#2890256 (10Marostegui) Hey @RobH no, no need to worry about it. It can wait Thanks for keeping an eye on it! [17:52:16] 10DBA: Install and reimage dbstore1001 as jessie - https://phabricator.wikimedia.org/T153768#2890451 (10Marostegui) [17:52:18] 10DBA: Install and reimage dbstore1001 as jessie - https://phabricator.wikimedia.org/T153768#2890466 (10Marostegui) [17:52:20] 10DBA: Install and reimage dbstore1001 as jessie - https://phabricator.wikimedia.org/T153768#2890468 (10Marostegui) [17:52:23] 10DBA: Install and reimage dbstore1001 as jessie - https://phabricator.wikimedia.org/T153768#2890471 (10Marostegui)