[00:50:46] 10DBA, 10Cloud-Services, 10Community-Wikimetrics, 10Icinga, and 2 others: Evaluate future of wmf puppet module "mysql" - https://phabricator.wikimedia.org/T165625#4071489 (10Dzahn) stopped using on in mediawiki_deployment_server role: https://gerrit.wikimedia.org/r/#/c/421197/ [06:20:25] 10DBA, 10Patch-For-Review: Decommission db1016 - https://phabricator.wikimedia.org/T190179#4071831 (10Marostegui) [06:22:07] 10DBA, 10Operations, 10hardware-requests, 10ops-eqiad, 10Patch-For-Review: Decommission db1016 - https://phabricator.wikimedia.org/T190179#4071834 (10Marostegui) a:03RobH This host is now ready for DC Ops steps. Assigning it to @RobH [06:23:00] 10DBA, 10Operations, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4071840 (10Marostegui) [06:24:41] 10DBA, 10Patch-For-Review: Decommission db1001 - https://phabricator.wikimedia.org/T190262#4071841 (10Marostegui) [06:36:17] 10DBA, 10Patch-For-Review: Decommission db1001 - https://phabricator.wikimedia.org/T190262#4071883 (10Marostegui) [06:38:17] 10DBA, 10Operations, 10hardware-requests, 10ops-eqiad, 10Patch-For-Review: Decommission db1001 - https://phabricator.wikimedia.org/T190262#4071893 (10Marostegui) a:03RobH This host is now ready for DC Ops steps. Assigning it to @RobH [06:39:19] 10DBA, 10Operations, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4071898 (10Marostegui) [06:41:38] 10DBA, 10Operations, 10Patch-For-Review: Followup for TLS MariaDB server roll-out - https://phabricator.wikimedia.org/T157702#4071905 (10Marostegui) [06:41:40] 10DBA, 10Operations, 10Goal, 10Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#4071906 (10Marostegui) [06:41:44] 10DBA, 10Operations, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4071903 (10Marostegui) 05Open>03Resolved All the hosts have been replaced. The old hosts are now ready for DC Ops to finish the decommissioned and... [06:43:45] 10DBA, 10Operations, 10Goal, 10Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#3153495 (10Marostegui) All the hosts <=db1050 have now been retired from service and are just pending to be decommissioned by DC Ops - they have their own individual deco... [07:25:08] https://gerrit.wikimedia.org/r/#/c/421222/ [07:38:29] 10DBA, 10Operations, 10monitoring, 10Patch-For-Review: Followup for TLS MariaDB server roll-out - https://phabricator.wikimedia.org/T157702#4071962 (10jcrespo) [07:40:00] only x1, s1 and m1 backups left [07:44:24] 10DBA, 10Operations, 10hardware-requests, 10Goal: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#4071966 (10jcrespo) [07:58:19] \o/ [08:31:05] 10DBA: Truncate `updatelog` on all wikis - https://phabricator.wikimedia.org/T174804#4072055 (10Marostegui) [08:58:18] 10DBA: Truncate `updatelog` on all wikis - https://phabricator.wikimedia.org/T174804#4072084 (10Marostegui) [09:10:48] <_joe_> marostegui, jynus since I'm blocked on the hardware goal, I am going to post a strawman proposal for the db-via-etcd project, so we can start reasoning on an actual implementation starting from there [09:13:02] _joe_: brilliant! thanks :) [09:16:17] _joe_: I was going to propose to create a mock structure but not actually being used [09:16:33] *but not use it yet, I mean [09:17:08] <_joe_> jynus: let's first see if my strawman convinces you people, then we can surely go in that direction [09:17:20] I think we will have some discussion between what we need for maintenance vs. what developers want for simplicity [09:17:51] 10DBA: Truncate `updatelog` on all wikis - https://phabricator.wikimedia.org/T174804#4072111 (10Marostegui) [09:19:54] 10DBA: Truncate `updatelog` on all wikis - https://phabricator.wikimedia.org/T174804#4072114 (10Marostegui) [09:23:26] 10DBA: Truncate `updatelog` on all wikis - https://phabricator.wikimedia.org/T174804#4072120 (10Marostegui) a:03Marostegui The last section s3 is being done. I have left it with long sleeps to avoid any possible delays, specially on labs, as we have seen it in the past. This will take a few hours to get done. [09:26:10] I 've made a first draft of the backups documentation https://wikitech.wikimedia.org/wiki/MariaDB/Backups [09:26:37] for now I have just dumper my mind, now I have to make them legible and with links [09:27:21] I will help out :) [09:27:21] taking a break, will come back by meeting time [09:29:18] <_joe_> uhm, do we ever have one db label in multiple sections? [09:29:36] <_joe_> so say it can happen that 'db1054' is in both s1 and s2? [09:29:47] yeah [09:30:05] we have hosts that in two sections, the recentchanges slaves [09:30:16] and some vslow, dumps too [09:30:34] check for instance db1096 in db-eqiad.php [09:30:41] it is in s5 and s6 [09:31:09] normally not the same instance [09:31:19] <_joe_> yeah I mean the label [09:31:19] but it can happen during a split [09:31:22] <_joe_> so host:port [09:31:41] <_joe_> so it can happen one host:port combo is in multiple sections [09:31:44] <_joe_> ok [09:31:46] it definetely can happen, although for obious reasons we try not to [09:37:07] <_joe_> ok [09:37:23] <_joe_> just trying to avoid bad assumptions :P [09:37:28] marostegui: another example would be pc1004 that you just did [10:59:56] 10DBA, 10Operations, 10hardware-requests, 10Goal: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#4072290 (10Marostegui) I talked to @mark and we'll leave this ticket open until they have been fully decommissioned by @RobH and @Cmjohnson [11:52:43] 10DBA, 10Cloud-Services: Multiple concurrent long running queries from s51434 overloading labsdb1003/labsdb1009 - https://phabricator.wikimedia.org/T133705#4072406 (10Marostegui) I think we can close this for now, we have the query killers working fine on labs hosts (T183983). Moreover, I haven't seen this que... [12:02:19] 10DBA, 10Cloud-Services: Multiple concurrent long running queries from s51434 overloading labsdb1003/labsdb1009 - https://phabricator.wikimedia.org/T133705#4072447 (10jcrespo) 05Open>03Resolved [14:23:21] marostegui: are you running truncate on s3? [14:29:46] I've stopped it, check https://grafana.wikimedia.org/dashboard/db/mysql-aggregated?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-group=core&var-shard=s3&var-role=All&from=1521707350184&to=1521728950184 (specially error rate per second and replication lag) [14:44:36] jynus: how's that possible? it was truncating tables with around 30 rows or so with a sleep of one minute :| [14:44:54] it was taking up to 10 seconds to do that sometimes [14:44:55] I mean, it is probably related, but so tiny tables... [14:45:02] wow, 10 seconds? [14:45:04] which is consistent with bad buffer pool behaviour [14:45:21] I did it manually on the other shards and it was taking around 1 second [14:45:23] and creating 10 seconds of lag, which ok as a 1 time, but not 900 times, I think [14:45:24] :| [14:45:31] totally [14:45:33] my advice [14:45:39] I am just surprised about its bad performance [14:45:40] run delete [14:45:43] of such a small table [14:45:44] yeah [14:45:48] that's probably better [14:45:51] and we will truncate with time, individually [14:45:58] as a low hanging task [14:46:08] yeah [14:46:22] sorry, I was seeing bad performance, and the first think I though was to stop all maintenance [14:46:33] It was the right thing to do [14:46:33] we also had increased unrelated high load [14:46:45] which probably contributed to do things worse [14:46:58] but basicaly because of the pool depool cycle [14:47:01] it is so surprising 10 seconds to truncate such a small table... [14:47:09] it create spikes of connections and disconnections [14:47:28] it is s3, if something can happen, it will happen :-) [14:47:32] haha [14:47:45] 30 errors/second for conections was too high, I think [14:47:52] I will schedule a delete with a 1 minute sleep tomorrow for the remaining wikis and monitor it [14:47:59] yeah, it is too high [14:48:06] I I am going to guess delete doesn't even need to wait [14:48:20] it is the truncate that realy does a drop [14:48:23] I have seen sanitarium/labs not coping well with lots of operations in s3 [14:48:25] and with large buffer pools [14:48:34] that is know as creating contention [14:48:46] I have not checked our partitioning model [14:48:50] maybe it needs tuning [14:49:11] I think it was not the only thing ongoing [14:49:20] servers are doing 30K QPS/s [14:49:29] so we have high load [14:49:54] could be linter [14:50:33] not sure what ApiStashEdit::checkCache is [14:50:33] yeah, I saw that, 30k.. [14:50:52] the kill "fixed" the lag and the connection errors [14:51:01] but there is something else going on [14:51:03] :) [14:52:53] there seems to be high number of enwikiversity queries, but maybe that is normal [14:53:30] I also see some work on testwikidatawiki maintenance connections [14:53:50] Amir1: anything scheduled for testwikidatawiki? [14:54:07] I see stuff for it [14:54:14] • 13:11 ladsgroup@tin: Synchronized wmf-config/Wikibase.php: https://gerrit.wikimedia.org/r/421269 (duration: 01m 15s) [14:54:17] • 13:09 ladsgroup@tin: Synchronized wmf-config/Wikibase-production.php: https://gerrit.wikimedia.org/r/421269 (duration: 01m 16s) [14:54:47] well, I see maintenance work, not really webrequest or job requests [14:58:43] I will keep an eye for it, the high QPS is not yet a huge issue [14:59:23] I don't see anything too related on last's night swat [15:03:14] the wikidata is the regular dispatcher, so it is not that [15:03:31] jynus: yup, we did something today [15:03:48] on testwikidata? [15:04:05] (I was on lunch break, sorry to miss it) [15:04:18] you are on your right to have your break [15:04:23] so please finish eating first! [15:04:59] we are not running any maintenance script but we are changing the columns it writes/reads from the database [15:05:13] is it putting too much pressure on the database? [15:05:15] ok, could it be more queries but around the same io? [15:05:34] because if it is that, then it is not worryng at all [15:05:56] the IO should decrease, the number queries should stay untouched [15:06:14] I need to dig deeper on this matter [15:06:22] then it is not that, we are seing an increase number of QPS on s3 [15:06:32] which is ok if it is planned [15:07:40] hmm, it can be a small wiki going crazy [15:07:50] we are not yet even sure it is wikidata, we need to look at more metrics [15:07:55] 10DBA, 10Operations, 10ops-eqiad: db1062 (s7 master) disk with lots of predictive failure errors - https://phabricator.wikimedia.org/T190303#4072904 (10Cmjohnson) Replaced disk and it's rebuilding Enclosure Device ID: 32 Slot Number: 6 Drive's position: DiskGroup: 0, Span: 3, Arm: 0 Enclosure position: 1 De... [15:08:56] funnily, there was another spike exactly 7 days ago [15:09:04] so it could be an unrelated cron [15:09:20] https://grafana.wikimedia.org/dashboard/db/mysql-aggregated?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-group=core&var-shard=s3&var-role=All&panelId=1&fullscreen&from=now-30d&to=now [15:13:11] 10DBA, 10Operations, 10ops-eqiad: db1061 (s6 master) disk with lots of predictive failure errors - https://phabricator.wikimedia.org/T190299#4072907 (10Cmjohnson) Replaced the disk at slot 11 cmjohnson@db1061:~$ sudo megacli -PDList -aALL | grep "Firmware state" Firmware state: Online, Spun Up Firmware stat... [15:14:56] <_joe_> https://grafana-admin.wikimedia.org/dashboard/db/mysql-aggregated?panelId=9&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-group=core&var-shard=s1&var-role=master&from=now-7d&to=now seems the per-section concurrency change was effective [15:17:30] 10DBA, 10Operations, 10ops-eqiad: db1054 (s2 master) disk with lots of predictive failure errors - https://phabricator.wikimedia.org/T190302#4072920 (10Cmjohnson) Disk has been replaced and raid is rebuilding cmjohnson@db1054:~$ sudo megacli -PDList -aALL | grep "Firmware state" Firmware state: Online, Spun... [15:22:21] 10DBA, 10Operations, 10ops-eqiad: db1052 (s1 master) disks with lots of predictive failure errors - https://phabricator.wikimedia.org/T190301#4072949 (10Cmjohnson) Replaced the disk at slot 2....I will wait for the rebuild to complete before swapping slot 8 Firmware state: Online, Spun Up Firmware state: On... [15:47:02] s3 qps keep increasing [15:49:30] mm, something happened yesterday at 17:44: https://grafana.wikimedia.org/dashboard/db/mysql?panelId=8&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1077&var-port=9104&from=1521637034143&to=1521681962143 [15:50:03] Amir1: This matches that increase: https://gerrit.wikimedia.org/r/#/c/420336/ [15:50:14] From SAL: • 17:46 maxsem@tin: Synchronized wmf-config/Wikibase.php: https://gerrit.wikimedia.org/r/#/c/420336/ (duration: 01m 15s) [15:50:35] can that explain that huge increase? [15:51:00] marostegui: existence of a bug can. Is there a phab card? [15:51:12] mmm this too: https://gerrit.wikimedia.org/r/#/c/420947/ [15:51:37] actually that last one matches more [15:52:01] hmm, the other one is more likely, ours should not touch QPS [15:52:09] and decrease IO [15:54:58] https://phabricator.wikimedia.org/T189806#4073107 [16:10:42] 10DBA, 10Operations, 10hardware-requests, 10ops-eqiad: Decommission db1009 - https://phabricator.wikimedia.org/T189216#4073157 (10Cmjohnson) [16:31:09] ok, so the revert worked apparently [17:22:37] 10DBA: Generate report of disk health for database masters and master candidates - https://phabricator.wikimedia.org/T190035#4073450 (10Marostegui) [17:22:40] 10DBA, 10Operations, 10ops-eqiad: db1062 (s7 master) disk with lots of predictive failure errors - https://phabricator.wikimedia.org/T190303#4073447 (10Marostegui) 05Open>03Resolved a:03Cmjohnson All good - thanks Chris! ``` root@db1062:~# megacli -LDPDInfo -aAll Adapter #0 Number of Virtual Disks:... [17:30:06] 10DBA: Generate report of disk health for database masters and master candidates - https://phabricator.wikimedia.org/T190035#4073464 (10Marostegui) [17:30:08] 10DBA, 10Operations, 10ops-eqiad: db1054 (s2 master) disk with lots of predictive failure errors - https://phabricator.wikimedia.org/T190302#4073461 (10Marostegui) 05Open>03Resolved a:03Cmjohnson All good now, thanks @Cmjohnson ``` root@db1054:~# megacli -LDPDInfo -aAll Adapter #0 Number of Virtu... [17:32:08] 10DBA: Generate report of disk health for database masters and master candidates - https://phabricator.wikimedia.org/T190035#4073470 (10Marostegui) [17:32:10] 10DBA, 10Operations, 10ops-eqiad: db1061 (s6 master) disk with lots of predictive failure errors - https://phabricator.wikimedia.org/T190299#4073467 (10Marostegui) 05Open>03Resolved a:03Cmjohnson All good now - thanks Chris! ``` root@db1061:~# megacli -LDPDInfo -aAll Adapter #0 Number of Virtual Dis... [19:37:09] 10DBA, 10Operations, 10ops-eqiad: db1052 (s1 master) disks with lots of predictive failure errors - https://phabricator.wikimedia.org/T190301#4073898 (10Cmjohnson) The disk at slot 8 has been swapped and rebuilding Firmware state: Online, Spun Up Firmware state: Online, Spun Up Firmware state: Online, Spun... [20:08:24] 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on db1052 - https://phabricator.wikimedia.org/T190446#4073977 (10Volans) p:05Triage>03Normal It's now rebuilding AFAIK there was a disk replaced: ``` $ sudo /usr/local/lib/nagios/plugins/get-raid-status-megacli === RaidStatus (does not include components i... [22:15:32] 10DBA, 10Gerrit, 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Next): Gerrit is failing to connect to db on gerrit2001 thus preventing systemd from working - https://phabricator.wikimedia.org/T176532#4074444 (10Dzahn) - added parameter to base monitoring class to allow disabling of system...