[03:45:51] 10DBA, 10Data-Services, 10cloud-services-team, 10wikitech.wikimedia.org: Move wikitech and labstestwiki to s5 - https://phabricator.wikimedia.org/T167973#4004830 (10Andrew) [04:53:35] 10DBA, 10Wikimedia-Site-requests: Global rename of Darkweasel94 → Tokfo: supervision needed - https://phabricator.wikimedia.org/T187629#4004915 (10alanajjar) If any one around please process this request [06:41:26] <_joe_> wmopbot? that's new [06:44:18] I didn't notice that [06:52:39] es1017 is 90% used (still 1.2T available), and there are 4.T in /tmp because of dbstore1001 data so we need to decide if we want to keep it or not (probably not worth) [06:53:48] 10DBA, 10Operations, 10ops-codfw: db2048: RAID with predictive failure - https://phabricator.wikimedia.org/T187983#4005049 (10Marostegui) 05Open>03Resolved Thanks! ``` root@db2048:~# hpssacli controller all show config Smart Array P420i in Slot 0 (Embedded) (sn: 0014380337E3350) Port Name: 1I... [07:30:35] _joe_: marostegui: I added it a few weeks ago. It alerts the IRC ops in #wikimedia-ops whenever things happen (kline, kick, etc.) and keeps a track on banned users [07:31:01] legoktm: Aaah right! :-) [07:31:02] <_joe_> legoktm: yeah I went to look it up :P [07:45:24] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: Disk #5 (count starts at #0) of db1111 has corrupted sectors - https://phabricator.wikimedia.org/T187526#3977928 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` ['db1111.eqiad.wmnet'] ```... [07:56:10] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: Disk #5 (count starts at #0) of db1111 has corrupted sectors - https://phabricator.wikimedia.org/T187526#4005143 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db1111.eqiad.wmnet'] ``` Of which those **FAILED**: ``` ['db1111.eqiad.wmnet'... [08:03:08] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: Disk #5 (count starts at #0) of db1111 has corrupted sectors - https://phabricator.wikimedia.org/T187526#4005162 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` ['db1111.eqiad.wmnet'] ```... [09:22:57] https://gerrit.wikimedia.org/r/#/c/414967/ [09:37:22] I told you! :-) [09:37:34] :) [09:38:07] I would delete dbstore1001 data [09:38:19] agreed - I will do it [09:50:57] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: Disk #5 (count starts at #0) of db1111 has corrupted sectors - https://phabricator.wikimedia.org/T187526#4005279 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db1111.eqiad.wmnet'] ``` and were **ALL** successful. [09:52:18] question- if the recovery is a snapshost of production [09:52:30] wouldn't just be easier to copy from production? [09:52:36] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: Disk #5 (count starts at #0) of db1111 has corrupted sectors - https://phabricator.wikimedia.org/T187526#4005283 (10Marostegui) 05Open>03Resolved a:03Marostegui The server was reimaged and all the data transferred back from db1112 and it is now fu... [09:52:41] jynus: you talking about db1111? [09:52:50] yep [09:53:07] ah, you fixed it already [09:53:10] Nah, they have special grants and they have commons + eowiki (and they might have different schemas for some tables) [09:53:10] ignore me [09:53:19] So it was "safer" to clone it from their own slave [09:53:31] Because they might have changed tables and all that [09:53:39] db1111 doesn't replicate from production [09:53:52] that should be puppetized more heavily [09:54:23] anyway, back to work [09:54:41] yeah, at least the grants (they only have two users, so at least a tracking .sql with that) [09:54:44] :) [09:54:46] one thing [09:54:58] shoot [09:55:09] if you have 1 minute, can you add an extra grant for labswiki on m5 for dump [09:55:20] assuming you have not done m5 yet [09:55:37] I ran the scripts and I was checking now for errors [09:55:48] I will do it, no worries [09:56:26] m2 takes a while because otrs attachments on db [09:57:05] yeah, it failed, so I will run it manually :) [09:57:09] Going to add the grants now [09:58:38] did you find what was going on with db1111 ? [09:58:44] controller being weird? [09:58:57] yeah, not the first time we saw that on a dell :( [09:59:10] remember we did some tests after it happened once…and it worked fine all the time? :( [09:59:31] but I want to remember that was on another brand [09:59:37] I think it was a dell [09:59:39] without the cache enabled [10:00:23] https://phabricator.wikimedia.org/T174054 [10:00:53] that is an HP [10:01:14] I am pretty sure we had that on a dell too [10:01:20] I will check later, going back to work :p [10:01:29] many things going on [10:01:35] if you mean [10:01:42] "controller failed" in general [10:01:50] all of them failed at some point [10:01:56] no [10:02:04] I was pointing specifically to "controller failed while bad disk" [10:02:05] the same behaviour of pulling a disk making it crash [10:02:08] yes [10:06:59] tendril is back up [10:07:04] and replicating is flowing on db2093 finely [10:07:23] we will see if it is able to stay in sync with the master (so far it cannot) [10:08:03] let's give it sometime [10:08:47] 10DBA, 10Patch-For-Review: Setup tendril database monitoring on 2 new hosts, one on eqiad and one on codfw - https://phabricator.wikimedia.org/T184704#4005324 (10Marostegui) The copy to codfw has been done. db2093 is now replicating from db1115. So far it is not able to keep up with the master, but let's give... [10:12:05] Idea- and I will disconnect [10:12:33] if it keeps like that for hours, set a filter on globa_status and global_status_5m [10:12:45] yeah, I was thinking about that [10:12:48] we'll see [10:23:08] marostegui: is it possible to perform T187629? [10:23:09] T187629: Global rename of Darkweasel94 → Tokfo: supervision needed - https://phabricator.wikimedia.org/T187629 [10:23:15] +100k edits [10:26:05] sure [10:26:59] okay so if you give me the green light I can start the process [10:27:08] sure go ahead [10:27:56] aye sir, let me fetch the link [10:31:06] 10DBA, 10Patch-For-Review: Setup tendril database monitoring on 2 new hosts, one on eqiad and one on codfw - https://phabricator.wikimedia.org/T184704#4005398 (10Marostegui) Replication was working smoothly till I added db2093 to tendril (it wasn't there). Then it broke on db2093 ``` Last_SQL_Er... [10:31:49] 10DBA, 10Wikimedia-Site-requests: Global rename of Darkweasel94 → Tokfo: supervision needed - https://phabricator.wikimedia.org/T187629#4005399 (10MarcoAurelio) https://meta.wikimedia.org/wiki/Special:GlobalRenameProgress/Tokfo [10:34:51] marostegui: are your trying to do a dump of tendril? [10:35:02] yes [10:35:08] nope [10:35:10] please kill it [10:35:16] ok [10:35:17] done [10:35:19] https://tendril.wikimedia.org/host [10:35:26] well, it will wokr now [10:35:34] remember when I said I tried? [10:35:51] I really tried, not like "I tried for 5 minutes and could do it :-)" [10:35:57] hehe [10:36:19] marostegui: that rename is in progress fwiw [10:36:23] metadata is impossible for a combination of things: federated, tokudb and all [10:36:23] ok [10:36:48] which means mysqldump but only of the real, non-large tables [10:37:00] right.. :( [10:37:08] I may have a list around, ask me later [10:37:19] sure, thanks [10:40:21] I have killed the blocked queries [10:40:34] if tendril doesn't start working again, restart the whole server [10:40:39] after some minutes [10:40:45] marostegui: the rename is gonna hit commons (+100k edits) shortly [10:40:48] ok :( [10:40:53] Hauskatze: ok [10:40:55] ah, it came back finally [10:40:59] see you! [10:41:18] o/ [10:41:58] cewiki and now commons [10:42:07] now [10:42:45] "in progress" [10:43:32] going fine so far [10:43:51] ok, it's a ton of edits [10:44:04] says "done" now [10:44:17] I guess some background jobs are still being performed though [10:44:39] I am seeing some delay now [10:44:53] on the old severs [10:45:03] it is gone now [10:45:07] codfw lagging behind [10:46:05] and esams? [10:46:16] we have no dbs there ;) [10:46:28] caching only [10:46:37] ok so equiad and ulsfo are ok [10:47:00] give time to the ol' Texas servers [10:57:42] <_joe_> Hauskatze: ulsfo has no dbs [10:57:54] <_joe_> only eqiad and codfw have the application layer [10:57:58] oh [10:58:04] thanks, good to know :) [10:58:04] <_joe_> all other datacenters are just caching pops [10:58:15] <_joe_> also include eqsin now [10:58:16] <_joe_> :P [10:58:23] that's why I never saw a script being run on ulsfo or esams [10:58:25] <_joe_> although it's not serving traffic atm [10:58:29] <_joe_> yes [10:58:38] <_joe_> there is no mediawiki there [10:58:46] equiad has terbium, and codfw has wasat [10:59:22] and eqsin is going to be caching only too? [11:01:17] https://wikitech.wikimedia.org/wiki/Eqsin_cluster says caching [11:01:20] :) [11:04:51] 10DBA, 10Patch-For-Review: Setup tendril database monitoring on 2 new hosts, one on eqiad and one on codfw - https://phabricator.wikimedia.org/T184704#4005479 (10Marostegui) We might be hitting: https://bugs.mysql.com/bug.php?id=70975 and https://mariadb.com/kb/en/library/mariadb-community-couldnt-execute-show... [11:05:15] 113 wikis to finish the rename [11:05:25] MariaDB [centralauth_p]> select count(*) from renameuser_status where ru_status="queued"; [11:24:57] 64 to go [11:25:28] :) [11:30:55] 10DBA, 10Patch-For-Review: Setup tendril database monitoring on 2 new hosts, one on eqiad and one on codfw - https://phabricator.wikimedia.org/T184704#4005516 (10jcrespo) "Note: restarting mysql fixed the problem, but that is rather intrusive and leaves the server cold for a while." ? [11:31:52] 10DBA, 10Patch-For-Review: Setup tendril database monitoring on 2 new hosts, one on eqiad and one on codfw - https://phabricator.wikimedia.org/T184704#4005533 (10Marostegui) That was tested of course, after every single action I listed before, but unfortunately that didn't work for us :-) [11:43:06] 10DBA, 10Patch-For-Review: Setup tendril database monitoring on 2 new hosts, one on eqiad and one on codfw - https://phabricator.wikimedia.org/T184704#4005549 (10jcrespo) Can you test on your lab if replication of events on that version fails from event_scheduler=1 to event_scheduler=0? I cannot even do SHOW E... [11:44:10] 10DBA, 10Patch-For-Review: Setup tendril database monitoring on 2 new hosts, one on eqiad and one on codfw - https://phabricator.wikimedia.org/T184704#4005551 (10Marostegui) >>! In T184704#4005549, @jcrespo wrote: > Can you test on your lab if replication of events on that version fails from event_scheduler=1... [11:46:04] 10DBA, 10Patch-For-Review: Setup tendril database monitoring on 2 new hosts, one on eqiad and one on codfw - https://phabricator.wikimedia.org/T184704#4005554 (10jcrespo) https://dev.mysql.com/doc/refman/5.7/en/replication-features-invoked.html According to this, an alternative would be to enable global events... [11:49:05] 10DBA, 10Patch-For-Review: Setup tendril database monitoring on 2 new hosts, one on eqiad and one on codfw - https://phabricator.wikimedia.org/T184704#4005558 (10Marostegui) >>! In T184704#4005554, @jcrespo wrote: > https://dev.mysql.com/doc/refman/5.7/en/replication-features-invoked.html According to this, an... [12:14:18] 10DBA, 10Patch-For-Review: Finish the database backups generation script to create consistent logical backups in CODFW - https://phabricator.wikimedia.org/T184696#4005632 (10jcrespo) {P6751} [12:14:23] marostegui: the rename's finished [12:14:31] how's on your side? [12:14:41] any laggings, breaks, etc? [12:14:55] nope [12:14:56] nothing [12:15:03] ok so I'll close as resolved [12:15:09] cool thanks [12:17:11] 10DBA, 10Wikimedia-Site-requests, 10User-MarcoAurelio: Global rename of Darkweasel94 → Tokfo: supervision needed - https://phabricator.wikimedia.org/T187629#4005634 (10MarcoAurelio) 05Open>03Resolved a:03MarcoAurelio Rename finished w/o incidents so far. [12:32:09] 10DBA, 10Patch-For-Review: Setup tendril database monitoring on 2 new hosts, one on eqiad and one on codfw - https://phabricator.wikimedia.org/T184704#4005665 (10Marostegui) Some more food for thought of my tests (10.1.31 on jessie) - I have manually corrupted the event table and restarted mysql and attempted... [13:38:21] 10DBA, 10Patch-For-Review: Finish the database backups generation script to create consistent logical backups in CODFW - https://phabricator.wikimedia.org/T184696#4005872 (10jcrespo) Looking good so far: ``` python3 dump_instance.py s1 ['/usr/bin/mydumper', '--compress', '--events', '--triggers', '--routine... [13:38:24] 10DBA, 10Patch-For-Review: Setup tendril database monitoring on 2 new hosts, one on eqiad and one on codfw - https://phabricator.wikimedia.org/T184704#4005873 (10Marostegui) I have opened a bug report with MariaDB: https://jira.mariadb.org/browse/MDEV-15426 [14:09:01] 10DBA, 10MediaWiki-Watchlist, 10Wikidata, 10Performance, and 3 others: re-enable Wikidata Recent Changes integration on Russian Wikipedia - https://phabricator.wikimedia.org/T179012#4005953 (10WMDE-leszek) 05stalled>03Open [14:09:18] 10DBA, 10Commons, 10Contributors-Team, 10MediaWiki-Watchlist, and 12 others: "Read timeout is reached" DBQueryError when trying to load specific users' watchlists (with +1000 articles) on several wikis - https://phabricator.wikimedia.org/T171027#4005955 (10WMDE-leszek) [14:09:46] 10DBA, 10Commons, 10MediaWiki-Watchlist, 10Wikidata, and 4 others: Re-enable Wikidata Recent Changes integration on Commons - https://phabricator.wikimedia.org/T179010#4005960 (10WMDE-leszek) 05Open>03stalled Now to be done after T179012 [14:10:02] 10DBA, 10Commons, 10Contributors-Team, 10MediaWiki-Watchlist, and 12 others: "Read timeout is reached" DBQueryError when trying to load specific users' watchlists (with +1000 articles) on several wikis - https://phabricator.wikimedia.org/T171027#3667201 (10WMDE-leszek) [15:27:21] 10DBA, 10Patch-For-Review: Prepare and indicate proper master db failover candidates for all eqiad database sections (s1-s8, x1) - https://phabricator.wikimedia.org/T186321#4006356 (10Marostegui) [15:47:19] 10DBA, 10Patch-For-Review: Prepare and indicate proper master db failover candidates for all eqiad database sections (s1-s8, x1) - https://phabricator.wikimedia.org/T186321#4006405 (10Marostegui) [15:53:35] 10DBA, 10Patch-For-Review: Prepare and indicate proper master db failover candidates for all eqiad database sections (s1-s8, x1) - https://phabricator.wikimedia.org/T186321#4006415 (10Marostegui) [18:35:50] 10DBA, 10GeoData: Removal of {{#coordinates:}} leaves DB entries behind - https://phabricator.wikimedia.org/T143366#4007182 (10Pnorman) Pulling off of the maps sprint, because it's not kartotherian, kartographer, or anything we're responsible for. I'm not sure who is responsible for GeoHack. [19:38:32] 10DBA, 10GeoData: Removal of {{#coordinates:}} leaves DB entries behind - https://phabricator.wikimedia.org/T143366#2566073 (10jcrespo) If there is not a responsible for the #GeoData extension, we should start the sunsetting process. [19:38:43] 10DBA, 10Wikidata, 10Patch-For-Review: Migrate wb_terms to using prefixed entity IDs instead of numeric IDs - https://phabricator.wikimedia.org/T114903#1709272 (10Ladsgroup) After this patch going live and waiting for a week, we should stop writing to that columns and start dropping dead code and eventually... [22:16:15] 10DBA, 10Community-Tech, 10MediaWiki-extensions-GlobalPreferences, 10Patch-For-Review, 10Schema-change: DBA review for GlobalPreferences schema - https://phabricator.wikimedia.org/T184666#4008324 (10Niharika)