[00:40:27] 10DBA, 10Data-Services, 10cloud-services-team, 10wikitech.wikimedia.org: move wikitech and labstestwiki to s3 (needs discussion) - https://phabricator.wikimedia.org/T167973#3405749 (10bd808) [00:40:31] 10DBA, 10Data-Services, 10MediaWiki-extensions-Babel, 10Security-Team, 10WMF-Legal: Replicate babel db table on Labs - https://phabricator.wikimedia.org/T160713#3405750 (10bd808) [00:40:33] 10DBA, 10Data-Services, 10Epic: Labs database replica drift - https://phabricator.wikimedia.org/T138967#3405751 (10bd808) [00:40:36] 10DBA, 10Data-Services: Expose ar_content_format and ar_content_model columns of archive table on Labs replicas - https://phabricator.wikimedia.org/T89741#3405752 (10bd808) [05:08:04] 10DBA, 10Analytics-Kanban, 10Operations, 10ops-eqiad, 10User-Elukey: db1046 BBU looks faulty - https://phabricator.wikimedia.org/T166141#3405812 (10Marostegui) This happened again ``` root@db1046:~# megacli -LDInfo -Lall -aALL | grep Policy Default Cache Policy: WriteBack, ReadAdaptive, Direct, No Write... [05:16:40] 10DBA, 10Analytics-Kanban, 10Operations, 10ops-eqiad, 10User-Elukey: db1046 BBU looks faulty - https://phabricator.wikimedia.org/T166141#3405816 (10Marostegui) And it recovered for now: ``` ˜/icinga-wm 7:15> RECOVERY - MegaRAID on db1046 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy root@d... [05:17:06] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661#3405817 (10Marostegui) s7 master (db1062) has been altered [05:17:17] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661#3405818 (10Marostegui) [05:29:30] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661#3405821 (10Marostegui) s3 master (db1075) has been altered [05:29:40] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661#3405822 (10Marostegui) [07:08:11] 10DBA, 10Analytics-Kanban, 10Operations, 10ops-eqiad, 10User-Elukey: db1046 BBU looks faulty - https://phabricator.wikimedia.org/T166141#3405971 (10Marostegui) It alerted again, but this time looks like the BBU is actually doing the learning: ``` BatteryType: BBU Voltage: 3754 mV Current: -674 mA Tempe... [07:24:10] 10DBA, 10Operations, 10ops-codfw: db2044: Disk on predictive failure - https://phabricator.wikimedia.org/T169693#3406026 (10Marostegui) [07:24:22] 10DBA, 10Operations, 10ops-codfw: db2044: Disk on predictive failure - https://phabricator.wikimedia.org/T169693#3406040 (10Marostegui) p:05Triage>03Normal [07:44:22] jynus: may I ask you for a second opinion on dbstore1001 s6 duplicate entry error? [07:44:53] I was about to ask you [07:45:01] ah :) [07:46:17] So, according to the error, 'Duplicate entry '296908517' for key 'PRIMARY', it does exist on dbstore1001, but I am not 100% sure if it is the right one to delete (remember s6 was the one that crashed and had to have watchlist reimported because it was acting weirdly with a insert replace) [07:48:17] well, don't ask me- check the relay log of the server, or the binlog of its master [07:48:43] that is what I did :) [07:48:47] do you want me to handle this? [07:49:21] I wanted a second opinion, I will double check and ping you if needed [07:49:24] no worries [07:49:33] I can check again if you want [07:49:45] I cannot know without checking [07:49:54] of course [08:12:42] if you don't mind, I would appreciate if you can check when you have time, I think it is safe to delete it, but I don't want to mess it up and would prefer another pair of eyes [08:12:49] sorry for bothering [08:12:59] worst case, I can just reimport it (12G) [08:38:26] marostegui: jynus: hello! We are going to reopen the closed wiki nlwikinews that has been closed for ages. I am wondering whether its database schema is up-to-date [08:38:31] and I am not sure how to verify that :D [08:39:09] hashar: wait a second [08:39:15] though I think "closed" means that only stewards can edit. So closed wikis are properly still maintained properly [08:39:18] no urgency [08:39:56] hashar: I checked the other day and it is still included on s3.dblist for instance, so it has been getting the latest schema changes [08:40:02] like the one i have done this week for the image table, for instanace [08:41:12] awesome :-} [08:41:13] marostegui: jynus: thank you! [08:52:34] hashar: what do you mean with "reopen" [08:52:43] closed -> not closed ? [08:53:01] yes [08:53:13] https://gerrit.wikimedia.org/r/#/c/361686/2/dblists/closed.dblist [08:53:47] the patch solely removes "nlwikinews" from closed.dblist which looks suspicious to me :D [08:53:49] but what are the implication [08:54:00] we keep the same content? [08:54:15] the intention with that [08:54:19] as I understand it "closed" wikis are only editable by stewards (and maybe some other privileged group of users) [08:54:33] no, I understand [08:54:42] the database has been kept, but all pages have been deleted. There was some discussions in that sense on the task [08:54:56] yeah, but users are still there [08:55:21] since deleting a wiki is a lot of work [08:55:34] so yeah users are still around, and apparently the wiki is still maintained properly [08:55:42] I am probably just overthinking [08:55:44] yeah [08:55:52] for us closed is like other wiki [08:56:02] private and deleted are the problematic ones [08:56:21] I just want to be sure you want to do the right thing [09:08:52] 10DBA, 10Analytics-Kanban, 10Operations, 10ops-eqiad, 10User-Elukey: db1046 BBU looks faulty - https://phabricator.wikimedia.org/T166141#3406297 (10Marostegui) `˜/icinga-wm 11:05> RECOVERY - MegaRAID on db1046 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy` [09:21:33] somehow I got disconnected from this channel .. [09:43:13] I am compiling 10.1.25- going for a coffee [09:43:26] enjoy [09:44:29] we should talk later about deplying dbstore_multiinstance [09:44:55] yep [10:00:32] 10DBA, 10MediaWiki-extensions-WikibaseClient, 10Wikidata, 10Patch-For-Review, and 2 others: Usage tracking: record which statement group is used - https://phabricator.wikimedia.org/T151717#3406454 (10Lydia_Pintscher) [11:36:17] I created https://phabricator.wikimedia.org/T169658 on mark's request [11:36:45] perfect, thanks [11:37:03] that was for marostegui, sorry I pinged you [11:37:59] I checked the right way to do that accoring to phab admins [11:38:25] and I didn't see a way- the old way (tags) is deperecated, but I cannod do the new way (milestones) [11:46:54] I've started working on a mariadb@ service unit : https://gerrit.wikimedia.org/r/#/c/363327/1/dbtools/mariadb%2540.service [11:55:16] Thanks for creating the main ticket [12:50:13] 10DBA, 10Operations, 10ops-eqiad: db1016 m1 master: Possibly faulty BBU - https://phabricator.wikimedia.org/T166344#3407193 (10Marostegui) ``` ˜/icinga-wm 14:48> PROBLEM - MegaRAID on db1016 is CRITICAL: CRITICAL: 1 LD(s) must have write cache policy WriteBack, currently using: WriteThrough BBU status for... [12:59:42] 10DBA, 10Operations, 10ops-eqiad: db1016 m1 master: Possibly faulty BBU - https://phabricator.wikimedia.org/T166344#3407221 (10Marostegui) ``` ˜/icinga-wm 14:58> RECOVERY - MegaRAID on db1016 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy ``` [13:07:05] 10Blocked-on-schema-change, 10DBA: Convert unique keys into primary keys for some wiki tables on s1 - https://phabricator.wikimedia.org/T166204#3407228 (10Marostegui) codfw finished [13:36:52] 10DBA, 10Wikidata, 10Patch-For-Review, 10Performance, and 2 others: slow master queries on Wikibase\Client\Usage\Sql\EntityUsageTable::getAffectedRowIds - https://phabricator.wikimedia.org/T169336#3407295 (10jcrespo) This is still ongoing with 15-minute queries. I am going to setup a task to kill all relat... [13:39:28] jynus marostegui I mentioned you in https://gerrit.wikimedia.org/r/#/c/362880, people are asking if it's okay to do select on master for ACID reasons but in smaller batch and I said this is something DBAs should decide [13:39:56] The bug we are talking about is https://phabricator.wikimedia.org/T169336 [13:42:08] I don't care what the queries are [13:42:20] I just need the queries to be fast [13:42:49] 15m query on a single point of failure is not fast [13:43:49] 24m query now [13:44:09] yeah, there are two ways to handle it now: 1- move it to slave 2- make it happen in smaller batches [13:44:17] which one do you recommand [13:44:19] 10DBA, 10Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, and 2 others: Reopen Wikinews Dutch - https://phabricator.wikimedia.org/T168764#3407320 (10Urbanecm) 05Open>03Resolved Wiki is reopened and it can be edited by anyone as of now. As there is nothing we can do now at our side, I'... [13:44:29] I have no preference [13:44:49] I just commented on the gerrit change [13:44:50] I suggested that exactly on the bug report [13:44:58] I rather have the queries sent to slaves to be honest [13:45:23] theongoing issue is the long running query- which I am going to setup to kill now [13:47:54] pt-kill F=/dev/null --socket=/tmp/mysql.sock --print --kill --victims=all --match-info="EntityUsageTable" --match-db=wikidatawiki --match-user=wikiuser --busy-time=1 [13:48:06] I am going to run that on a screen session, marostegui [13:48:10] on db1063 [13:48:20] let me know if you see something wrong with that [13:48:40] jynus: it looks good to me [13:48:44] almost 30 minutes now, is madness [13:48:47] maybe in a while loop in case it [13:48:55] fails, at is happens sometimes [13:49:28] marostegui: the main issue is that with a couple of those queries [13:49:37] it can bring down the master [13:49:46] lovely [13:49:54] set the while loop then :-) [13:49:59] similar to what it happened on x1 [13:50:10] during the switchover no? [13:50:11] and those were just a few seconds-long [13:50:14] yeah [13:50:24] so we have to be super-strinct on the master [13:50:46] we do not care much, for example, if it had to run on the vslow slaves [13:51:42] the query is now gone, did you run it now? [13:51:52] I did, but I didn't kill it [13:52:02] it may have finished after 30 minutes or something [13:52:20] it printed it on my tests, however [13:52:21] yeah, coincidence [13:53:13] it is a screen for now on db1063, as I would expect this to be a temporary measure [13:53:35] but given the history, we may want to puppetize pt-kill for generic usage [13:54:31] yeah, not a bad idea [13:54:34] how watchdog would work, but we do not run it on the master because it is dangerous to kill writes [13:54:57] because revertion and all that [13:55:08] *rollback, I meant [13:55:20] yeah, that is the fear when setting up stuff that kills things on master, i am always scared of doing so on an automated way [13:55:35] it can be a big mess up [13:55:45] that is why it is important to be very strict on master queries [13:56:08] slaves have watchdogs already [13:56:24] yeah, and in this case we care "less" if we have a 30 minutes query on a slave [13:56:28] but on the master, that is mad [13:59:09] 10DBA, 10Wikidata, 10Patch-For-Review, 10Performance, and 2 others: slow master queries on Wikibase\Client\Usage\Sql\EntityUsageTable::getAffectedRowIds - https://phabricator.wikimedia.org/T169336#3407421 (10jcrespo) I've setup a temporary watchdog on the s5 master: ``` pt-kill F=/dev/null --socket=/tmp/m... [14:11:13] 10DBA, 10Analytics: Drop MoodBar tables from all wikis - https://phabricator.wikimedia.org/T153033#3407478 (10MarcoAurelio) Per above. [14:12:15] 10DBA, 10Cloud-Services, 10Patch-For-Review: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts - https://phabricator.wikimedia.org/T153743#3407481 (10Marostegui) I thought I would update with the latest news from this task: db1102 is the new sanitarium 3 running multi-instance with... [15:21:47] can I restart db2072 for new package testing? [15:22:13] I see it downtime'd, maybe you are doing something on it? [16:15:13] I will now restart db2062 [16:44:17] 10DBA, 10Operations, 10ops-codfw: db2044: Disk on predictive failure - https://phabricator.wikimedia.org/T169693#3408155 (10Papaul) Dear Mr Papaul Tshibamba, Thank you for contacting Hewlett Packard Enterprise for your service request. This email confirms your request for service and the details are below.... [16:45:23] 10DBA, 10Operations, 10Wikimedia-Site-requests, 10User-MarcoAurelio: Global rename of Idh0854 → Garam: supervision needed - https://phabricator.wikimedia.org/T167031#3408161 (10MarcoAurelio) [16:51:21] 10DBA: Truncate table "l10n_cache" on all wmf sites - https://phabricator.wikimedia.org/T169375#3408186 (10demon) [16:51:23] 10DBA: truncate l10n_cache table on WMF wikis - https://phabricator.wikimedia.org/T150306#3408189 (10demon) [17:04:18] 10DBA, 10MediaWiki-Parser, 10MediaWiki-Platform-Team, 10Patch-For-Review: WMF ParserCache disk space exhaustion - https://phabricator.wikimedia.org/T167784#3408252 (10jcrespo) Now I can see the disk usage getting stable: https://grafana.wikimedia.org/dashboard/file/server-board.json?refresh=1m&panelId=17&... [17:08:15] 10DBA, 10Patch-For-Review: Refactor puppet mariadb class to support multi-instance hosts - https://phabricator.wikimedia.org/T169514#3408265 (10jcrespo) I have uploaded mariadb 10.1.25, and at the same time implemented the missing multi-instance support on the package (mariadb@ systemd unit). Tomorrow we shoul... [17:22:50] jynus: Hi! just following up on T168584. Can we setup a time to do labsdb1005 first (re: my comment https://phabricator.wikimedia.org/T168584#3395348) [17:22:51] T168584: Labsdb* servers need to be rebooted - https://phabricator.wikimedia.org/T168584 [18:12:52] 10DBA, 10Wikidata, 10Performance, 10User-Ladsgroup, 10Wikidata-Sprint: slow master queries on Wikibase\Client\Usage\Sql\EntityUsageTable::getAffectedRowIds - https://phabricator.wikimedia.org/T169336#3395074 (10Ladsgroup) 05Open>03Resolved [18:30:04] jynus: marostegui puppet disable on labsdb1009 and 10, is that still active? [18:44:45] 10DBA, 10Data-Services: Expose ar_content_format and ar_content_model columns of archive table on Labs replicas - https://phabricator.wikimedia.org/T89741#1044220 (10chasemp) @Umherirrender could you put up a patch to https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/role/tem... [19:04:42] 10DBA, 10Cloud-Services: Drop ukwikimedia from labsdb hosts (was: ukwikimedia still present on replicas dbs on labs hosts) - https://phabricator.wikimedia.org/T169488#3399766 (10chasemp) >>! In T169488#3403344, @Marostegui wrote: > The reason I suggested to use the script because it was pointed out to me that... [19:08:22] 10DBA, 10Cloud-Services: Drop ukwikimedia from labsdb hosts (was: ukwikimedia still present on replicas dbs on labs hosts) - https://phabricator.wikimedia.org/T169488#3408797 (10Marostegui) It is not super common, but if a wiki is moved to deleted.dblist, then our check_private_data script will complain and we... [19:13:43] chasemp: yep, it is, sorry [19:13:55] marostegui: no worries, just checking :) [19:14:26] Is it blocking you? [19:17:01] marostegui: not meaningfully, i can do a maintain-views run whenever man. no big deal. [19:17:53] chasemp: You sure? I can enable it, you run it, and then i can disable it :) [19:18:07] I'll come back to it friday [19:18:12] Ok :) [19:18:14] Thanks! [19:32:01] 10DBA, 10Data-Services: Drop ukwikimedia from labsdb hosts (was: ukwikimedia still present on replicas dbs on labs hosts) - https://phabricator.wikimedia.org/T169488#3408913 (10bd808) 05Open>03Resolved a:03jcrespo I ran `sudo /usr/local/sbin/maintain-meta_p --all-databases --debug` on: * labsdb1001 * lab... [20:22:51] 10DBA, 10Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, and 2 others: Reopen Wikinews Dutch - https://phabricator.wikimedia.org/T168764#3409167 (10Dcljr) >>! In T168764#3407320, @Urbanecm wrote: > Wiki is reopened and it can be edited by anyone as of now. Technically, yes, but the conv... [20:37:29] 10DBA, 10Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, and 2 others: Reopen Wikinews Dutch - https://phabricator.wikimedia.org/T168764#3409233 (10MF-Warburg) But that is irrelevant for the bug: >>! In T168764#3407320, @Urbanecm wrote: > As there is nothing we can do now at our side [20:43:55] 10DBA, 10Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, and 2 others: Reopen Wikinews Dutch - https://phabricator.wikimedia.org/T168764#3376111 (10Koavf) For those watching this, the site is live. [20:44:54] 10DBA, 10Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, and 2 others: Reopen Wikinews Dutch - https://phabricator.wikimedia.org/T168764#3409284 (10Dcljr) I know, but just in case some users interested in editing the wiki are following this task, they should know that editing should wait... [20:44:59] 10DBA, 10Data-Services, 10MediaWiki-extensions-Babel, 10Security-Team, 10WMF-Legal: Replicate babel db table on Labs - https://phabricator.wikimedia.org/T160713#3409285 (10APalmer_WMF) Approved by Legal. Thanks, all. [20:50:12] 10DBA, 10Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, and 2 others: Reopen Wikinews Dutch - https://phabricator.wikimedia.org/T168764#3409307 (10Koavf) Yes, that is what I was trying to say--import seems to be done and there aren't redlinks everywhere anymore. [20:51:57] 10DBA, 10Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, and 2 others: Reopen Wikinews Dutch - https://phabricator.wikimedia.org/T168764#3409316 (10Dcljr) Sigh… My previous comment was in response to MF-Warburg. [22:11:38] 10DBA, 10Data-Services, 10MediaWiki-extensions-Babel: Replicate babel db table on Labs - https://phabricator.wikimedia.org/T160713#3409602 (10bd808) p:05Triage>03Normal Approved by Legal and Security. The table needs to be added to maintain-views.yaml and then run in all the appropriate places. [22:16:20] 10DBA, 10Data-Services, 10MediaWiki-extensions-Babel, 10Patch-For-Review, 10cloud-services-team (Kanban): Replicate babel db table on Labs - https://phabricator.wikimedia.org/T160713#3409619 (10bd808) [22:43:59] 10DBA, 10Analytics, 10Analytics-EventLogging, 10Contributors-Analysis, and 5 others: Drop tables with bad data: mediawiki_page_create_1 mediawiki_revision_create_1 - https://phabricator.wikimedia.org/T169781#3409695 (10kaldari)