[05:26:28] 10DBA, 10Operations, 10Wikimedia-Site-requests: Global rename supervision request: Papa1234 → Karl-Heinz Jansen - https://phabricator.wikimedia.org/T173859#3547688 (10Marostegui) >>! In T173859#3545916, @Steinsplitter wrote: >>>! In T173859#3544614, @Marostegui wrote: >> @MarcoAurelio See: T172207#3544611 >>... [05:26:46] 10DBA, 10Operations, 10Wikimedia-Site-requests: Global rename supervision request: Papa1234 → Karl-Heinz Jansen - https://phabricator.wikimedia.org/T173859#3547689 (10Marostegui) 05stalled>03Open [05:31:42] 10DBA, 10Community-Tech, 10cloud-services-team, 10Security: create production ip_changes table for RangeContributions - https://phabricator.wikimedia.org/T173891#3547704 (10Marostegui) [05:37:39] 10DBA, 10Community-Tech, 10cloud-services-team, 10Security: create production ip_changes table for RangeContributions - https://phabricator.wikimedia.org/T173891#3547705 (10Marostegui) >>! In T173891#3546800, @kaldari wrote: > Ran `foreachwiki sql.php /srv/mediawiki/php/maintenance/archives/patch-ip_change... [06:01:25] 10DBA, 10Community-Tech, 10cloud-services-team, 10Security: create production ip_changes table for RangeContributions - https://phabricator.wikimedia.org/T173891#3547706 (10Marostegui) >>! In T173891#3545400, @kaldari wrote: >>if possible try to avoid assigning it directly to Jaime > I assigned it to Jaime... [06:02:19] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: RAID crashed on db1078 - https://phabricator.wikimedia.org/T173365#3547707 (10Marostegui) >>! In T173365#3545014, @Cmjohnson wrote: > @Marostegui The ssd has been replaced. Please resolve after rebuild Should we close this ticket and create a new one f... [06:07:36] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: Decommission db1041 - https://phabricator.wikimedia.org/T173915#3547710 (10Marostegui) a:03Cmjohnson This host is now ready to be decommissioned and ready for @Cmjohnson do the DC-Ops part [08:04:05] jynus: hey, this is ready for merge: https://gerrit.wikimedia.org/r/#/c/370626/7 [08:04:17] I was checking it out [08:04:24] I have been monitoring everything and it's fine [08:04:38] https://puppet-compiler.wmflabs.org/compiler02/7589/terbium.eqiad.wmnet/ [08:04:53] I am not convinced about using /tmpp [08:05:19] what do you suggest? [08:05:44] a dir only www-data can write [08:06:36] /home/www-data ? [08:08:32] also, the cron parameters are wrong, unless you want that to execute only on sundays at 3:30 [08:08:49] okay. Let me fix that [08:10:33] jynus: Are you sure? I checked the top cronjobs and they are being ran every hour [08:10:39] *above cronjobs [08:12:52] I sent you https://puppet-compiler.wmflabs.org/compiler02/7589/terbium.eqiad.wmnet/ [08:13:39] others have '"minute": "*/3"' [08:13:58] or minute: [0, 15, 30, 45] [08:14:01] yours has [08:14:18] "minute": "30", "hour": "3", "weekday": "0" [08:14:39] check it for yourself on that link [08:19:30] let's move the log to /var/log/wikidata [08:19:35] let me try to learn that [08:19:41] jynus: Already did that [08:19:56] I'm trying to fix the timing and then upload the new patchset [08:20:21] I will move the (assume) canonical on tmp to there when you tell me [08:21:02] did you check the issue with logging frequency? [08:21:32] yeah, it will write once in four/three minutes [08:21:35] I guess that's okay [08:21:40] cool yes [08:21:55] we can leave it running for more than an hour if that will be a problem [08:22:09] will it skip rows if they are already done? [08:22:49] so make the changes, and you can rebuild the puppet compiler job yourself [08:23:22] at https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/7589/ [08:23:28] yeah it will skip them but report (so it will write faster for the first 500K rows) [08:23:30] (logged) [08:23:52] oh, we can start on the 500Kth [08:23:56] that is no problem [08:24:22] the problem with tmp is that it can easily be overwritten [08:24:45] so better with the other logs [08:25:00] if there is any problem, we can remove or modify it [08:25:26] but better than other random process overwritting it [08:27:25] jynus: agreed, but regarding the timing, I couldn't find the problem or the way to fix it, https://docs.puppet.com/puppet/latest/types/cron.html says removed parameters are "*" [08:27:49] and couldn't find the cron in the puppet complier [08:28:12] maybe the behaviour is different in our version [08:28:41] I would, out of carefulness, be explicit and put all of those as '*' [08:28:58] okay [08:29:08] maybe it is a compiler limitation, but better be sure- I remember having reverted several cron deployments [08:29:24] because I am 100% sure the production behaviour is the same than the compilation [08:29:48] so just hour => '*', weekday => '*' should be enough [08:30:45] jynus: okay, uploaded the new patchset, please check :) [08:31:07] https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/7590/console [08:31:39] and if you ask, yes, that is manual CI^ [08:31:47] nice [08:32:30] jynus: once it's merged, is it possible to write the last lines of /tmp/rebuildTermSqlIndex.log into to file I specified in the puppet patch? [08:32:42] sure [08:32:49] It will make the script to skip the first 500K items [08:32:50] thanks [08:34:16] mmmm, what will it happen on rotation? [08:35:34] nothing I guess, but the files are super super small [08:35:37] it won't log much [08:35:54] The file is not rotating [08:36:57] maxage 180 [08:37:17] hopefully this will not take more than 180 days :-) [08:38:38] btw. the jobqueue is still not happy: https://grafana.wikimedia.org/dashboard/db/job-queue-health?refresh=1m&orgId=1&from=now-30d&to=now [08:43:34] Amir1: so "cp /tmp/rebuildTermSqlIndex.log /var/log/wikidata/rebuildTermSqlIndex.log" ? [08:43:43] yeah [08:43:54] last one is 488002 [08:43:59] Q518873 [08:44:03] that's correct [08:44:32] tail -n 1 /var/log/wikidata/rebuildTermSqlIndex.log -> Processed up to page 488516 (Q519411) [08:44:54] nice [08:44:56] Thanks! [08:45:11] start monitoring that long when I tell you [08:45:14] *log [08:45:20] sure [08:45:26] running puppet agent? [08:45:35] I have to deploy it first [08:45:44] kk [08:47:26] Amir1: note it will not only be better for config changes, you will not need to track the state most of the time [08:48:10] sure [08:49:14] I can confirm it is on www-data's crontab # Puppet Name: wikibase-rebuildTermSqlIndex [08:50:20] It hasn't started yet I think it will start in 40 minutes [08:50:56] I can do maybe a test run? [08:51:06] with lower timeout? [08:51:35] yeah sure [08:55:05] running now [08:55:34] works fine [08:55:41] reported a new row [08:55:59] cool [08:56:03] I will let it finish [08:56:10] to see if something strange happens on kill [08:57:04] 10DBA, 10Epic, 10Tracking: Database tables to be dropped on Wikimedia wikis and other WMF databases (tracking) - https://phabricator.wikimedia.org/T54921#3547874 (10Marostegui) [08:58:47] Amir1: what do you think of this- I will prepare a temporary stop of the script, in case something goes wrong, you can tell any ops easily [08:59:15] yeah sure [08:59:23] hopefully not needed [09:13:47] 10DBA, 10Wikidata, 10Patch-For-Review, 10User-Ladsgroup, 10Wikidata-Sprint: Populate term_full_entity_id on www.wikidata.org - https://phabricator.wikimedia.org/T171460#3465512 (10jcrespo) So this is deployed into production, we did a test run and it seems to work as intended. I left a "disable" patch h... [10:02:29] jynus: are you joining the meeting? [11:04:49] jynus, marostegui: a quick curiosity about the 30m restart time on new hardware, did mariadb improved what I reported in https://jira.mariadb.org/browse/MDEV-9930 ? [11:06:03] I can show you on graphs [11:07:38] https://grafana.wikimedia.org/dashboard/db/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1078&var-port=9104&from=1502901364771&to=1502922681478 [11:07:41] vs [11:08:38] https://grafana.wikimedia.org/dashboard/db/mysql?orgId=1&from=1503378415766&to=1503572897594&var-dc=codfw%20prometheus%2Fops&var-server=dbstore2001&var-port=13315 [11:09:01] new host can be pooled after 30 minutes, even if it takes 6 hours to be ok [11:09:23] dbstores have problem catching up after restart [11:09:46] ok, fair enough! thanks [11:10:39] the only problem I see is that if we restart again before the 6h to get the full buffer pool re-populated we'll get a smaller buffer pool dump to reload [11:11:13] not very important on backup hosts that are not serving traffic anyway :) [11:30:16] 10DBA, 10Operations, 10Wikimedia-Site-requests: Global rename supervision request: Papa1234 → Karl-Heinz Jansen - https://phabricator.wikimedia.org/T173859#3548140 (10Steinsplitter) 05Open>03Resolved done: https://meta.wikimedia.org/wiki/Special:GlobalRenameProgress/Karl-Heinz_Jansen Thanks @Marostegui [11:33:54] 10DBA, 10MediaWiki-Parser, 10MediaWiki-Platform-Team, 10Patch-For-Review: WMF ParserCache disk space exhaustion - https://phabricator.wikimedia.org/T167784#3548145 (10jcrespo) 05Open>03Resolved a:03jcrespo > To keep defragmenting on a regular basis? Yes that is a horrible thing to do, but seeing reg... [11:42:00] 10DBA, 10Patch-For-Review: Productionize 11 new eqiad database servers - https://phabricator.wikimedia.org/T172679#3548155 (10jcrespo) [11:43:26] 10DBA, 10Patch-For-Review: Productionize 11 new eqiad database servers - https://phabricator.wikimedia.org/T172679#3548157 (10Marostegui) [11:59:44] 10DBA, 10Patch-For-Review: Productionize 11 new eqiad database servers - https://phabricator.wikimedia.org/T172679#3505388 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` ['db1096.eqiad.wmnet'] ``` The log can be found in `/var/log/wmf-auto-reim... [12:21:25] 10DBA, 10Patch-For-Review: Productionize 11 new eqiad database servers - https://phabricator.wikimedia.org/T172679#3548219 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db1096.eqiad.wmnet'] ``` and were **ALL** successful. [12:55:06] 10DBA, 10MediaWiki-Watchlist, 10Wikimedia-General-or-Unknown, 10Wikimedia-log-errors: User with 40000 entries in their Watchlist cannot access it on Commons anymore: Database error - https://phabricator.wikimedia.org/T171898#3548310 (10Aklapper) Merging into {T171027} as it seems to be the same underlying... [12:55:35] 10DBA, 10MediaWiki-Watchlist, 10Wikimedia-General-or-Unknown, 10Wikimedia-log-errors: User with 40000 entries in their Watchlist cannot access it on Commons anymore: Database error - https://phabricator.wikimedia.org/T171898#3548312 (10Aklapper) [15:19:47] 10Blocked-on-schema-change, 10MediaWiki-Platform-Team, 10Structured-Data-Commons, 10Wikidata: Deploy MCR storage layer - https://phabricator.wikimedia.org/T174044#3549134 (10daniel) [16:16:43] 10Blocked-on-schema-change, 10MediaWiki-Platform-Team, 10Structured-Data-Commons, 10Wikidata: Deploy MCR storage layer - https://phabricator.wikimedia.org/T174044#3549380 (10daniel) [16:19:28] 10DBA, 10Operations, 10Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#3549410 (10jcrespo) It is my intention to reimage db1069, provisioning it from db1033 (s7) and pool it as a db1028 replacement, making both db1033 and db1028 obsolete (to be retire... [16:42:29] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: RAID crashed on db1078 - https://phabricator.wikimedia.org/T173365#3549499 (10jcrespo) 05Open>03Resolved a:03Cmjohnson > Should we close this ticket and create a new one for testing another host and see its behaviour? Let's just do it. [16:47:02] 10DBA: Test reliability of RAID configuration/database hosts on single disk failure - https://phabricator.wikimedia.org/T174054#3549516 (10jcrespo) [16:47:16] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: RAID crashed on db1078 - https://phabricator.wikimedia.org/T173365#3549532 (10Cmjohnson) let me know which db you want to test and when? [16:47:18] 10DBA: Test reliability of RAID configuration/database hosts on single disk failure - https://phabricator.wikimedia.org/T174054#3549533 (10jcrespo) p:05Triage>03Normal [16:47:57] 10DBA: Test reliability of RAID configuration/database hosts on single disk failure - https://phabricator.wikimedia.org/T174054#3549535 (10Marostegui) This is what I commented on the other ticket: ``` I would like to propose db1076 (s2) as a candidate host to do the test once db1078 is back in the pool with the... [16:48:56] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: RAID crashed on db1078 - https://phabricator.wikimedia.org/T173365#3549544 (10jcrespo) Let's us some time to find a good candidate and create some fake load and we will ping either you or Papaul on T174054. [16:52:47] 10DBA, 10Patch-For-Review: Productionize 11 new eqiad database servers - https://phabricator.wikimedia.org/T172679#3549555 (10Marostegui) [19:13:43] 10DBA, 10Operations: decomission db1033 and db1028 - https://phabricator.wikimedia.org/T174076#3550282 (10jcrespo) [19:26:08] 10DBA, 10Operations, 10Patch-For-Review: Decommission db1033 and db1028 - https://phabricator.wikimedia.org/T174076#3550329 (10jcrespo) [19:27:56] 10DBA, 10Operations, 10ops-eqiad, 10Patch-For-Review: RAID crashed on db1078 - https://phabricator.wikimedia.org/T173365#3550377 (10Cmjohnson) Return shipping info for disk UPS 1ZW0948Y9082750467