14:30:16 <ttereshc> #startmeeting Pulp Triage 2017-08-18
14:30:16 <ttereshc> !start
14:30:16 <ttereshc> #info ttereshc has joined triage
14:30:16 <pulpbot> Meeting started Fri Aug 18 14:30:16 2017 UTC and is due to finish in 60 minutes.  The chair is ttereshc. Information about MeetBot at http://wiki.debian.org/MeetBot.
14:30:16 <pulpbot> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
14:30:16 <pulpbot> The meeting name has been set to 'pulp_triage_2017_08_18'
14:30:16 <pulpbot> ttereshc: ttereshc has joined triage
14:30:21 <daviddavis> !here
14:30:21 <daviddavis> #info daviddavis has joined triage
14:30:21 <pulpbot> daviddavis: daviddavis has joined triage
14:30:24 <mhrivnak> elijah_d ok we can chat after triage.
14:30:26 <mhrivnak> #info mhrivnak has joined triage
14:30:26 <mhrivnak> !here
14:30:26 <pulpbot> mhrivnak: mhrivnak has joined triage
14:30:36 <bizhang> !here
14:30:36 <bizhang> #info bizhang has joined triage
14:30:37 <pulpbot> bizhang: bizhang has joined triage
14:30:38 <ttereshc> !next
14:30:39 <bmbouter> !here
14:30:40 <ttereshc> #topic Unable to sync docker repo because worker dies - http://pulp.plan.io/issues/2966
14:30:40 <bmbouter> #info bmbouter has joined triage
14:30:40 <pulpbot> ttereshc: 3 issues left to triage: 2966, 2979, 2985
14:30:41 <pulpbot> Issue #2966 [NEW] (unassigned) - Priority: Normal | Severity: High
14:30:42 <pulpbot> Unable to sync docker repo because worker dies - http://pulp.plan.io/issues/2966
14:30:43 <pulpbot> bmbouter: bmbouter has joined triage
14:30:44 <elijah_d> mhrivnak, ok
14:30:49 <dkliban> !here
14:30:49 <dkliban> #info dkliban has joined triage
14:30:49 <pulpbot> dkliban: dkliban has joined triage
14:31:00 <dalley> !here
14:31:00 <dalley> #info dalley has joined triage
14:31:00 <pulpbot> dalley: dalley has joined triage
14:31:08 <ttereshc> mhrivnak, were you talking about this issue with elijah_d ?
14:31:30 <ttereshc> yeah, I see, so skip it for now?
14:31:38 <mhrivnak> Yes. I just want to try digging in a bit more to see if we can figure out where sigkill is coming from.
14:31:48 <ttereshc> !propose skip
14:31:48 <ttereshc> #idea Proposed for #2966: Skip this issue for this triage session.
14:31:48 <pulpbot> ttereshc: Proposed for #2966: Skip this issue for this triage session.
14:31:55 <bmbouter> try using strace on the pid receiving the sigkill
14:31:58 <mhrivnak> Yeah, I think that's fine. We'll look at it today.
14:32:11 <ttereshc> !accept
14:32:11 <ttereshc> #agreed Skip this issue for this triage session.
14:32:11 <pulpbot> ttereshc: Current proposal accepted: Skip this issue for this triage session.
14:32:12 <pulpbot> ttereshc: 2 issues left to triage: 2979, 2985
14:32:12 <mhrivnak> I don't think strace will work, because sigkill doesn't go to the process.
14:32:13 <ttereshc> #topic Celery workers may deadlock when PULP_MAX_TASKS_PER_CHILD and mongo replica set are used - http://pulp.plan.io/issues/2979
14:32:13 <pulpbot> Issue #2979 [NEW] (unassigned) - Priority: Normal | Severity: High
14:32:14 <pulpbot> Celery workers may deadlock when PULP_MAX_TASKS_PER_CHILD and mongo replica set are used - http://pulp.plan.io/issues/2979
14:32:19 <ipanova> !here
14:32:19 <ipanova> #info ipanova has joined triage
14:32:19 <pulpbot> ipanova: ipanova has joined triage
14:32:21 <mhrivnak> But there are some other kernel audit tricks that might help. :)
14:32:30 <bmbouter> oh yeah sigkill goes to init
14:32:38 <bmbouter> so I commented on this one
14:32:58 <bmbouter> I did not comment that both katello and sat will be affected by this b/c they both use that MAX_TASKS... option
14:33:39 <ipanova> mhrivnak: i was trying to track down the sigkill with audit and so far no luck
14:33:40 <bmbouter> fixing this will be a significant change to how our celery stuff runs, I think the only way is to have celery stop forking
14:34:00 <mhrivnak> ipanova ok, let's compare notes in a bit.
14:34:20 <bmbouter> and we need one more piece of info, which is: is the postgresql client driver single threaded or not?
14:34:30 <bmbouter> I'm not sure, but if it is then not only will pulp2 have this problem but also pulp3
14:34:49 <daviddavis> 349766
14:34:53 <daviddavis> oops
14:34:55 <mhrivnak> Is there a recommendation we can make for being able to identify when one of them is deadlocked?
14:35:09 <bmbouter> there is not unfortunately
14:35:19 <bmbouter> perhaps they could count the threads
14:35:49 <bmbouter> but that could also be unreliable since the thread count changes during operation
14:36:10 <bmbouter> really you need a core dump to look at specifically how many pymongo threads are in existance
14:36:29 <mhrivnak> But if it makes it far enough past the fork to be spawning extra threads, has it effectively dodged the bullet?
14:37:09 <mhrivnak> Anyway, maybe that's getting into the weeds.
14:37:13 <ichimonji10> elijah_d++
14:37:14 <pulpbot> ichimonji10: elijah_d's karma is now 10
14:37:22 <bmbouter> it has, but the problem is that before it makes it that far, you can't know if it will make it that far or if its deadlocked already
14:37:23 <elijah_d> ichimonji10, thanks!
14:37:29 <mhrivnak> If there's a way to help people identify this in the mean time, that could mitigate the severity.
14:37:38 <bmbouter> there isn't a reliable way
14:38:52 <mhrivnak> So options seem to be 1) make celery stop forking, 2) figure out how to delay database access during startup until post-fork, 3) wait for pulp 3?
14:39:18 <mhrivnak> not suggesting any of those are good or easy of course. :)
14:39:26 <bmbouter> yeah but they are clear options
14:39:39 <bmbouter> so I don't want to take this on the sprint, but I think we kind of need to
14:39:47 <bmbouter> at least to understand if pulp3 will be affected or not
14:40:10 <bmbouter> option (2) can't be done without rearchitecting our crash-recover scenarios
14:40:18 <mhrivnak> I suggest we accept it but not take it on this sprint.
14:40:34 <mhrivnak> I think we can get to the other side of the plugin api work and then look at this.
14:40:47 <bmbouter> that is also what I want, but consider this
14:41:02 <bmbouter> katello and sat are both enabling the MAX_TASKS_.... options
14:41:04 <bmbouter> option
14:41:19 <bmbouter> so they will experience rare deadlocking by doing that
14:41:41 <bmbouter> that is the only thing that makes me think we should do more (even though I want to focus on the plugin API for pulp3)
14:41:42 <daviddavis> ugg
14:42:36 <mhrivnak> Gotcha. If you want to pursue this on this sprint, that's fine with me.
14:42:55 <ttereshc> so it looks like it was enabled in 6.2.7
14:43:00 <mhrivnak> It seems like any solution is likely to have a long timeline, but it doesn't hurt to start quickly.
14:43:14 <ttereshc> it's been a while and no reports so far
14:43:23 <bmbouter> whatever you all decide is fine w/ me
14:43:28 <ttereshc> (I mean maybe we still can wait till the next sprint)
14:43:31 <bmbouter> I wanted to just provide the scope of impact, etc
14:43:47 <bmbouter> also I can advise about how to fix (stop forking) but I don't plan to take as assigned
14:43:50 <bmbouter> either way
14:44:04 <ttereshc> !propose accept and add to sprint
14:44:04 <pulpbot> ttereshc: propose accept Propose accepting the current issue in its current state.
14:44:13 <ttereshc> !propose other accept and add to sprint
14:44:13 <ttereshc> #idea Proposed for #2979: accept and add to sprint
14:44:13 <pulpbot> ttereshc: Proposed for #2979: accept and add to sprint
14:44:34 <mhrivnak> +1
14:45:03 <ttereshc> !accept
14:45:03 <ttereshc> #agreed accept and add to sprint
14:45:03 <pulpbot> ttereshc: Current proposal accepted: accept and add to sprint
14:45:05 <ttereshc> #topic I can create importers/publishers for any repo while targeting a specific repo URL - http://pulp.plan.io/issues/2985
14:45:05 <pulpbot> ttereshc: 1 issues left to triage: 2985
14:45:06 <pulpbot> Issue #2985 [NEW] (unassigned) - Priority: Normal | Severity: Medium
14:45:07 <pulpbot> I can create importers/publishers for any repo while targeting a specific repo URL - http://pulp.plan.io/issues/2985
14:46:44 <mhrivnak> This one definitely needs fixing.
14:47:00 <ipanova> let's add it to the sprint?
14:47:05 <ttereshc> I guess this issue is valid for any nested endpoints
14:47:09 <dkliban> yeah
14:47:10 <mhrivnak> That works for me.
14:47:15 <ttereshc> not only importers
14:47:16 <daviddavis> +1
14:47:18 <dkliban> +1
14:47:22 <ipanova> +1
14:47:42 <daviddavis> accept, add to sprint, and comment about checking other nested urls
14:47:56 <ttereshc> !propose other  accept, add to sprint, and comment about checking other nested url
14:47:56 <ttereshc> #idea Proposed for #2985: accept, add to sprint, and comment about checking other nested url
14:47:57 <pulpbot> ttereshc: Proposed for #2985: accept, add to sprint, and comment about checking other nested url
14:48:03 <ttereshc> !accept
14:48:03 <ttereshc> #agreed accept, add to sprint, and comment about checking other nested url
14:48:03 <pulpbot> ttereshc: Current proposal accepted: accept, add to sprint, and comment about checking other nested url
14:48:04 <pulpbot> ttereshc: No issues to triage.
14:48:08 <ttereshc> !end
14:48:08 <ttereshc> #endmeeting