Donate to e Foundation | Murena handsets with /e/OS | Own a part of Murena! Learn more

Commit 59927fb9 authored by Hugh Dickins's avatar Hugh Dickins Committed by Linus Torvalds
Browse files

memcg: free mem_cgroup by RCU to fix oops



After fixing the GPF in mem_cgroup_lru_del_list(), three times one
machine running a similar load (moving and removing memcgs while
swapping) has oopsed in mem_cgroup_zone_nr_lru_pages(), when retrieving
memcg zone numbers for get_scan_count() for shrink_mem_cgroup_zone():
this is where a struct mem_cgroup is first accessed after being chosen
by mem_cgroup_iter().

Just what protects a struct mem_cgroup from being freed, in between
mem_cgroup_iter()'s css_get_next() and its css_tryget()? css_tryget()
fails once css->refcnt is zero with CSS_REMOVED set in flags, yes: but
what if that memory is freed and reused for something else, which sets
"refcnt" non-zero? Hmm, and scope for an indefinite freeze if refcnt is
left at zero but flags are cleared.

It's tempting to move the css_tryget() into css_get_next(), to make it
really "get" the css, but I don't think that actually solves anything:
the same difficulty in moving from css_id found to stable css remains.

But we already have rcu_read_lock() around the two, so it's easily fixed
if __mem_cgroup_free() just uses kfree_rcu() to free mem_cgroup.

However, a big struct mem_cgroup is allocated with vzalloc() instead of
kzalloc(), and we're not allowed to vfree() at interrupt time: there
doesn't appear to be a general vfree_rcu() to help with this, so roll
our own using schedule_work().  The compiler decently removes
vfree_work() and vfree_rcu() when the config doesn't need them.

Signed-off-by: default avatarHugh Dickins <hughd@google.com>
Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Ying Han <yinghan@google.com>
Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
parent f1cbd03f
Loading
Loading
Loading
Loading
+47 −6
Original line number Diff line number Diff line
@@ -230,10 +230,30 @@ struct mem_cgroup {
	 * the counter to account for memory usage
	 */
	struct res_counter res;

	union {
		/*
		 * the counter to account for mem+swap usage.
		 */
		struct res_counter memsw;

		/*
		 * rcu_freeing is used only when freeing struct mem_cgroup,
		 * so put it into a union to avoid wasting more memory.
		 * It must be disjoint from the css field.  It could be
		 * in a union with the res field, but res plays a much
		 * larger part in mem_cgroup life than memsw, and might
		 * be of interest, even at time of free, when debugging.
		 * So share rcu_head with the less interesting memsw.
		 */
		struct rcu_head rcu_freeing;
		/*
		 * But when using vfree(), that cannot be done at
		 * interrupt time, so we must then queue the work.
		 */
		struct work_struct work_freeing;
	};

	/*
	 * Per cgroup active and inactive list, similar to the
	 * per zone LRU lists.
@@ -4779,6 +4799,27 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
	return NULL;
}

/*
 * Helpers for freeing a vzalloc()ed mem_cgroup by RCU,
 * but in process context.  The work_freeing structure is overlaid
 * on the rcu_freeing structure, which itself is overlaid on memsw.
 */
static void vfree_work(struct work_struct *work)
{
	struct mem_cgroup *memcg;

	memcg = container_of(work, struct mem_cgroup, work_freeing);
	vfree(memcg);
}
static void vfree_rcu(struct rcu_head *rcu_head)
{
	struct mem_cgroup *memcg;

	memcg = container_of(rcu_head, struct mem_cgroup, rcu_freeing);
	INIT_WORK(&memcg->work_freeing, vfree_work);
	schedule_work(&memcg->work_freeing);
}

/*
 * At destroying mem_cgroup, references from swap_cgroup can remain.
 * (scanning all at force_empty is too costly...)
@@ -4802,9 +4843,9 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg)

	free_percpu(memcg->stat);
	if (sizeof(struct mem_cgroup) < PAGE_SIZE)
		kfree(memcg);
		kfree_rcu(memcg, rcu_freeing);
	else
		vfree(memcg);
		call_rcu(&memcg->rcu_freeing, vfree_rcu);
}

static void mem_cgroup_get(struct mem_cgroup *memcg)