[03:45:51] <wikibugs>	 10DBA, 10Data-Services, 10cloud-services-team, 10wikitech.wikimedia.org: Move wikitech and labstestwiki to s5 - https://phabricator.wikimedia.org/T167973#4004830 (10Andrew)
[04:53:35] <wikibugs>	 10DBA, 10Wikimedia-Site-requests: Global rename of Darkweasel94 → Tokfo: supervision needed - https://phabricator.wikimedia.org/T187629#4004915 (10alanajjar) If any one around please process this request
[06:41:26] <_joe_>	 wmopbot? that's new
[06:44:18] <marostegui>	 I didn't notice that
[06:52:39] <marostegui>	 es1017 is 90% used (still 1.2T available), and there are 4.T in /tmp because of dbstore1001 data so we need to decide if we want to keep it or not (probably not worth)
[06:53:48] <wikibugs>	 10DBA, 10Operations, 10ops-codfw: db2048: RAID with predictive failure - https://phabricator.wikimedia.org/T187983#4005049 (10Marostegui) 05Open>03Resolved Thanks! ``` root@db2048:~# hpssacli controller all show config  Smart Array P420i in Slot 0 (Embedded)    (sn: 0014380337E3350)      Port Name: 1I...
[07:30:35] <legoktm>	 _joe_: marostegui: I added it a few weeks ago. It alerts the IRC ops in #wikimedia-ops whenever things happen (kline, kick, etc.) and keeps a track on banned users
[07:31:01] <marostegui>	 legoktm: Aaah right! :-)
[07:31:02] <_joe_>	 legoktm: yeah I went to look it up :P
[07:45:24] <wikibugs>	 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: Disk #5 (count starts at #0) of db1111 has corrupted sectors - https://phabricator.wikimedia.org/T187526#3977928 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` ['db1111.eqiad.wmnet'] ```...
[07:56:10] <wikibugs>	 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: Disk #5 (count starts at #0) of db1111 has corrupted sectors - https://phabricator.wikimedia.org/T187526#4005143 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db1111.eqiad.wmnet'] ```  Of which those **FAILED**: ``` ['db1111.eqiad.wmnet'...
[08:03:08] <wikibugs>	 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: Disk #5 (count starts at #0) of db1111 has corrupted sectors - https://phabricator.wikimedia.org/T187526#4005162 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` ['db1111.eqiad.wmnet'] ```...
[09:22:57] <marostegui>	 https://gerrit.wikimedia.org/r/#/c/414967/
[09:37:22] <jynus>	 I told you! :-)
[09:37:34] <marostegui>	 :)
[09:38:07] <jynus>	 I would delete dbstore1001 data
[09:38:19] <marostegui>	 agreed - I will do it
[09:50:57] <wikibugs>	 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: Disk #5 (count starts at #0) of db1111 has corrupted sectors - https://phabricator.wikimedia.org/T187526#4005279 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db1111.eqiad.wmnet'] ```  and were **ALL** successful.
[09:52:18] <jynus>	 question- if the recovery is a snapshost of production
[09:52:30] <jynus>	 wouldn't just be easier to copy from production?
[09:52:36] <wikibugs>	 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: Disk #5 (count starts at #0) of db1111 has corrupted sectors - https://phabricator.wikimedia.org/T187526#4005283 (10Marostegui) 05Open>03Resolved a:03Marostegui The server was reimaged and all the data transferred back from db1112 and it is now fu...
[09:52:41] <marostegui>	 jynus: you talking about db1111?
[09:52:50] <jynus>	 yep
[09:53:07] <jynus>	 ah, you fixed it already
[09:53:10] <marostegui>	 Nah, they have special grants and they have commons + eowiki (and they might have different schemas for some tables)
[09:53:10] <jynus>	 ignore me
[09:53:19] <marostegui>	 So it was "safer" to clone it from their own slave
[09:53:31] <marostegui>	 Because they might have changed tables and all that
[09:53:39] <marostegui>	 db1111 doesn't replicate from production
[09:53:52] <jynus>	 that should be puppetized more heavily
[09:54:23] <jynus>	 anyway, back to work
[09:54:41] <marostegui>	 yeah, at least the grants (they only have two users, so at least a tracking .sql with that)
[09:54:44] <marostegui>	 :)
[09:54:46] <jynus>	 one thing
[09:54:58] <marostegui>	 shoot
[09:55:09] <jynus>	 if you have 1 minute, can you add an extra grant for labswiki on m5 for dump
[09:55:20] <jynus>	 assuming you have not done m5 yet
[09:55:37] <marostegui>	 I ran the scripts and I was checking now for errors
[09:55:48] <marostegui>	 I will do it, no worries
[09:56:26] <jynus>	 m2 takes a while because otrs attachments on db
[09:57:05] <marostegui>	 yeah, it failed, so I will run it manually :)
[09:57:09] <marostegui>	 Going to add the grants now
[09:58:38] <jynus>	 did you find what was going on with db1111 ?
[09:58:44] <jynus>	 controller being weird?
[09:58:57] <marostegui>	 yeah, not the first time we saw that on a dell :(
[09:59:10] <marostegui>	 remember we did some tests after it happened once…and it worked fine all the time? :(
[09:59:31] <jynus>	 but I want to remember that was on another brand
[09:59:37] <marostegui>	 I think it was a dell
[09:59:39] <jynus>	 without the cache enabled
[10:00:23] <jynus>	 https://phabricator.wikimedia.org/T174054
[10:00:53] <marostegui>	 that is an HP
[10:01:14] <marostegui>	 I am pretty sure we had that on a dell too
[10:01:20] <marostegui>	 I will check later, going back to work :p
[10:01:29] <marostegui>	 many things going on
[10:01:35] <jynus>	 if you mean
[10:01:42] <jynus>	 "controller failed" in general
[10:01:50] <jynus>	 all of them failed at some point
[10:01:56] <marostegui>	 no
[10:02:04] <jynus>	 I was pointing specifically to "controller failed while bad disk"
[10:02:05] <marostegui>	 the same behaviour of pulling a disk making it crash
[10:02:08] <jynus>	 yes
[10:06:59] <marostegui>	 tendril is back up
[10:07:04] <marostegui>	 and replicating is flowing on db2093 finely
[10:07:23] <marostegui>	 we will see if it is able to stay in sync with the master (so far it cannot)
[10:08:03] <marostegui>	 let's give it sometime
[10:08:47] <wikibugs>	 10DBA, 10Patch-For-Review: Setup tendril database monitoring on 2 new hosts, one on eqiad and one on codfw - https://phabricator.wikimedia.org/T184704#4005324 (10Marostegui) The copy to codfw has been done. db2093 is now replicating from db1115. So far it is not able to keep up with the master, but let's give...
[10:12:05] <jynus>	 Idea- and I will disconnect
[10:12:33] <jynus>	 if it keeps like that for hours, set a filter on globa_status and global_status_5m
[10:12:45] <marostegui>	 yeah, I was thinking about that
[10:12:48] <marostegui>	 we'll see
[10:23:08] <Hauskatze>	 marostegui: is it possible to perform T187629?
[10:23:09] <stashbot>	 T187629: Global rename of Darkweasel94 → Tokfo: supervision needed - https://phabricator.wikimedia.org/T187629
[10:23:15] <Hauskatze>	 +100k edits
[10:26:05] <marostegui>	 sure
[10:26:59] <Hauskatze>	 okay so if you give me the green light I can start the process
[10:27:08] <marostegui>	 sure go ahead
[10:27:56] <Hauskatze>	 aye sir, let me fetch the link
[10:31:06] <wikibugs>	 10DBA, 10Patch-For-Review: Setup tendril database monitoring on 2 new hosts, one on eqiad and one on codfw - https://phabricator.wikimedia.org/T184704#4005398 (10Marostegui) Replication was working smoothly till I added db2093 to tendril (it wasn't there). Then it broke on db2093 ```                Last_SQL_Er...
[10:31:49] <wikibugs>	 10DBA, 10Wikimedia-Site-requests: Global rename of Darkweasel94 → Tokfo: supervision needed - https://phabricator.wikimedia.org/T187629#4005399 (10MarcoAurelio) https://meta.wikimedia.org/wiki/Special:GlobalRenameProgress/Tokfo
[10:34:51] <jynus>	 marostegui: are your trying to do a dump of tendril?
[10:35:02] <marostegui>	 yes
[10:35:08] <jynus>	 nope
[10:35:10] <jynus>	 please kill it
[10:35:16] <marostegui>	 ok
[10:35:17] <marostegui>	 done
[10:35:19] <jynus>	 https://tendril.wikimedia.org/host
[10:35:26] <jynus>	 well, it will wokr now
[10:35:34] <jynus>	 remember when I said I tried?
[10:35:51] <jynus>	 I really tried, not like "I tried for 5 minutes and could do it :-)"
[10:35:57] <marostegui>	 hehe
[10:36:19] <Hauskatze>	 marostegui: that rename is in progress fwiw
[10:36:23] <jynus>	 metadata is impossible for a combination of things: federated, tokudb and all
[10:36:23] <marostegui>	 ok
[10:36:48] <jynus>	 which means mysqldump but only of the real, non-large tables
[10:37:00] <marostegui>	 right.. :(
[10:37:08] <jynus>	 I may have a list around, ask me later
[10:37:19] <marostegui>	 sure, thanks
[10:40:21] <jynus>	 I have killed the blocked queries
[10:40:34] <jynus>	 if tendril doesn't start working again, restart the whole server
[10:40:39] <jynus>	 after some minutes
[10:40:45] <Hauskatze>	 marostegui: the rename is gonna hit commons (+100k edits) shortly
[10:40:48] <marostegui>	 ok :(
[10:40:53] <marostegui>	 Hauskatze: ok
[10:40:55] <jynus>	 ah, it came back finally
[10:40:59] <jynus>	 see you!
[10:41:18] <marostegui>	 o/
[10:41:58] <Hauskatze>	 cewiki and now commons
[10:42:07] <Hauskatze>	 now
[10:42:45] <Hauskatze>	 "in progress"
[10:43:32] <marostegui>	 going fine so far
[10:43:51] <Hauskatze>	 ok, it's a ton of edits
[10:44:04] <Hauskatze>	 says "done" now
[10:44:17] <Hauskatze>	 I guess some background jobs are still being performed though
[10:44:39] <marostegui>	 I am seeing some delay now
[10:44:53] <marostegui>	 on the old severs
[10:45:03] <marostegui>	 it is gone now
[10:45:07] <marostegui>	 codfw lagging behind
[10:46:05] <Hauskatze>	 and esams?
[10:46:16] <marostegui>	 we have no dbs there ;)
[10:46:28] <Hauskatze>	 caching only
[10:46:37] <Hauskatze>	 ok so equiad and ulsfo are ok
[10:47:00] <Hauskatze>	 give time to the ol' Texas servers
[10:57:42] <_joe_>	 Hauskatze: ulsfo has no dbs
[10:57:54] <_joe_>	 only eqiad and codfw have the application layer
[10:57:58] <Hauskatze>	 oh
[10:58:04] <Hauskatze>	 thanks, good to know :)
[10:58:04] <_joe_>	 all other datacenters are just caching pops
[10:58:15] <_joe_>	 also include eqsin now 
[10:58:16] <_joe_>	 :P
[10:58:23] <Hauskatze>	 that's why I never saw a script being run on ulsfo or esams
[10:58:25] <_joe_>	 although it's not serving traffic atm
[10:58:29] <_joe_>	 yes
[10:58:38] <_joe_>	 there is no mediawiki there
[10:58:46] <Hauskatze>	 equiad has terbium, and codfw has wasat
[10:59:22] <Hauskatze>	 and eqsin is going to be caching only too?
[11:01:17] <Hauskatze>	 https://wikitech.wikimedia.org/wiki/Eqsin_cluster says caching
[11:01:20] <Hauskatze>	 :)
[11:04:51] <wikibugs>	 10DBA, 10Patch-For-Review: Setup tendril database monitoring on 2 new hosts, one on eqiad and one on codfw - https://phabricator.wikimedia.org/T184704#4005479 (10Marostegui) We might be hitting: https://bugs.mysql.com/bug.php?id=70975 and https://mariadb.com/kb/en/library/mariadb-community-couldnt-execute-show...
[11:05:15] <Hauskatze>	 113 wikis to finish the rename
[11:05:25] <Hauskatze>	 MariaDB [centralauth_p]> select count(*) from renameuser_status where ru_status="queued";
[11:24:57] <Hauskatze>	 64 to go
[11:25:28] <marostegui>	 :)
[11:30:55] <wikibugs>	 10DBA, 10Patch-For-Review: Setup tendril database monitoring on 2 new hosts, one on eqiad and one on codfw - https://phabricator.wikimedia.org/T184704#4005516 (10jcrespo) "Note: restarting mysql fixed the problem, but that is rather intrusive and leaves the server cold for a while." ?
[11:31:52] <wikibugs>	 10DBA, 10Patch-For-Review: Setup tendril database monitoring on 2 new hosts, one on eqiad and one on codfw - https://phabricator.wikimedia.org/T184704#4005533 (10Marostegui) That was tested of course, after every single action I listed before, but unfortunately that didn't work for us :-)
[11:43:06] <wikibugs>	 10DBA, 10Patch-For-Review: Setup tendril database monitoring on 2 new hosts, one on eqiad and one on codfw - https://phabricator.wikimedia.org/T184704#4005549 (10jcrespo) Can you test on your lab if replication of events on that version fails from event_scheduler=1 to event_scheduler=0? I cannot even do SHOW E...
[11:44:10] <wikibugs>	 10DBA, 10Patch-For-Review: Setup tendril database monitoring on 2 new hosts, one on eqiad and one on codfw - https://phabricator.wikimedia.org/T184704#4005551 (10Marostegui) >>! In T184704#4005549, @jcrespo wrote: > Can you test on your lab if replication of events on that version fails from event_scheduler=1...
[11:46:04] <wikibugs>	 10DBA, 10Patch-For-Review: Setup tendril database monitoring on 2 new hosts, one on eqiad and one on codfw - https://phabricator.wikimedia.org/T184704#4005554 (10jcrespo) https://dev.mysql.com/doc/refman/5.7/en/replication-features-invoked.html According to this, an alternative would be to enable global events...
[11:49:05] <wikibugs>	 10DBA, 10Patch-For-Review: Setup tendril database monitoring on 2 new hosts, one on eqiad and one on codfw - https://phabricator.wikimedia.org/T184704#4005558 (10Marostegui) >>! In T184704#4005554, @jcrespo wrote: > https://dev.mysql.com/doc/refman/5.7/en/replication-features-invoked.html According to this, an...
[12:14:18] <wikibugs>	 10DBA, 10Patch-For-Review: Finish the database backups generation script to create consistent logical backups in CODFW - https://phabricator.wikimedia.org/T184696#4005632 (10jcrespo) {P6751}
[12:14:23] <Hauskatze>	 marostegui: the rename's finished
[12:14:31] <Hauskatze>	 how's on your side?
[12:14:41] <Hauskatze>	 any laggings, breaks, etc?
[12:14:55] <marostegui>	 nope
[12:14:56] <marostegui>	 nothing
[12:15:03] <Hauskatze>	 ok so I'll close as resolved
[12:15:09] <marostegui>	 cool thanks
[12:17:11] <wikibugs>	 10DBA, 10Wikimedia-Site-requests, 10User-MarcoAurelio: Global rename of Darkweasel94 → Tokfo: supervision needed - https://phabricator.wikimedia.org/T187629#4005634 (10MarcoAurelio) 05Open>03Resolved a:03MarcoAurelio Rename finished w/o incidents so far.
[12:32:09] <wikibugs>	 10DBA, 10Patch-For-Review: Setup tendril database monitoring on 2 new hosts, one on eqiad and one on codfw - https://phabricator.wikimedia.org/T184704#4005665 (10Marostegui) Some more food for thought of my tests (10.1.31 on jessie) - I have manually corrupted the event table and restarted mysql and attempted...
[13:38:21] <wikibugs>	 10DBA, 10Patch-For-Review: Finish the database backups generation script to create consistent logical backups in CODFW - https://phabricator.wikimedia.org/T184696#4005872 (10jcrespo) Looking good so far:   ``` python3 dump_instance.py  s1 ['/usr/bin/mydumper', '--compress', '--events', '--triggers', '--routine...
[13:38:24] <wikibugs>	 10DBA, 10Patch-For-Review: Setup tendril database monitoring on 2 new hosts, one on eqiad and one on codfw - https://phabricator.wikimedia.org/T184704#4005873 (10Marostegui) I have opened a bug report with MariaDB: https://jira.mariadb.org/browse/MDEV-15426
[14:09:01] <wikibugs>	 10DBA, 10MediaWiki-Watchlist, 10Wikidata, 10Performance, and 3 others: re-enable Wikidata Recent Changes integration on Russian Wikipedia - https://phabricator.wikimedia.org/T179012#4005953 (10WMDE-leszek) 05stalled>03Open
[14:09:18] <wikibugs>	 10DBA, 10Commons, 10Contributors-Team, 10MediaWiki-Watchlist, and 12 others: "Read timeout is reached" DBQueryError when trying to load specific users' watchlists (with +1000 articles) on several wikis - https://phabricator.wikimedia.org/T171027#4005955 (10WMDE-leszek)
[14:09:46] <wikibugs>	 10DBA, 10Commons, 10MediaWiki-Watchlist, 10Wikidata, and 4 others: Re-enable Wikidata Recent Changes integration on Commons - https://phabricator.wikimedia.org/T179010#4005960 (10WMDE-leszek) 05Open>03stalled Now to be done after T179012
[14:10:02] <wikibugs>	 10DBA, 10Commons, 10Contributors-Team, 10MediaWiki-Watchlist, and 12 others: "Read timeout is reached" DBQueryError when trying to load specific users' watchlists (with +1000 articles) on several wikis - https://phabricator.wikimedia.org/T171027#3667201 (10WMDE-leszek)
[15:27:21] <wikibugs>	 10DBA, 10Patch-For-Review: Prepare and indicate proper master db failover candidates for all eqiad database sections (s1-s8, x1) - https://phabricator.wikimedia.org/T186321#4006356 (10Marostegui)
[15:47:19] <wikibugs>	 10DBA, 10Patch-For-Review: Prepare and indicate proper master db failover candidates for all eqiad database sections (s1-s8, x1) - https://phabricator.wikimedia.org/T186321#4006405 (10Marostegui)
[15:53:35] <wikibugs>	 10DBA, 10Patch-For-Review: Prepare and indicate proper master db failover candidates for all eqiad database sections (s1-s8, x1) - https://phabricator.wikimedia.org/T186321#4006415 (10Marostegui)
[18:35:50] <wikibugs>	 10DBA, 10GeoData: Removal of {{#coordinates:}} leaves DB entries behind - https://phabricator.wikimedia.org/T143366#4007182 (10Pnorman) Pulling off of the maps sprint, because it's not kartotherian, kartographer, or anything we're responsible for. I'm not sure who is responsible for GeoHack.
[19:38:32] <wikibugs>	 10DBA, 10GeoData: Removal of {{#coordinates:}} leaves DB entries behind - https://phabricator.wikimedia.org/T143366#2566073 (10jcrespo) If there is not a responsible for the #GeoData extension, we should start the sunsetting process.
[19:38:43] <wikibugs>	 10DBA, 10Wikidata, 10Patch-For-Review: Migrate wb_terms to using prefixed entity IDs instead of numeric IDs - https://phabricator.wikimedia.org/T114903#1709272 (10Ladsgroup) After this patch going live and waiting for a week, we should stop writing to that columns and start dropping dead code and eventually...
[22:16:15] <wikibugs>	 10DBA, 10Community-Tech, 10MediaWiki-extensions-GlobalPreferences, 10Patch-For-Review, 10Schema-change: DBA review for GlobalPreferences schema - https://phabricator.wikimedia.org/T184666#4008324 (10Niharika)