- freeing the page
there are multiple high level page free functions; free_pages_bulk()
is the lowest level function that does the real free.
-
+
When the memory subsystem runs low on LRU pages, pages are reclaimed by
- moving pages from active list to inactive list (refill_inactive_zone())
- - freeing pages from the inactive list (shrink_zone)
+ - freeing pages from the inactive list (shrink_zone)
depending on the recent usage of the page(approximately).
+In the process of the life cycle a page can move from the lru list to swap
+and back. For this document's purpose, we treat it same as freeing and
+allocating the page, respectfully.
+
1. Introduction
---------------
Memory resource controller controls the number of lru physical pages
Note that the numbers that are specified in the shares file, doesn't
directly correspond to the number of pages. But, the user can make
it so by making the total_guarantee and max_limit of the default class
-(/rcfs/taskclass) to be the total number of pages(given in config file)
+(/rcfs/taskclass) to be the total number of pages(given in stats file)
available in the system.
for example:
# cd /rcfs/taskclass
- # cat config
- res=mem;tot_pages=239778,active=60473,inactive=135285,free=44555
+ # grep System stats
+ System: tot_pages=257512,active=5897,inactive=2931,free=243991
# cat shares
res=mem,guarantee=-2,limit=-2,total_guarantee=100,max_limit=100
- "tot_pages=239778" above mean there are 239778 lru pages in
+ "tot_pages=257512" above mean there are 257512 lru pages in
the system.
By making total_guarantee and max_limit to be same as this number at
this level (/rcfs/taskclass), one can make guarantee and limit in all
classes refer to the number of pages.
- # echo 'res=mem,total_guarantee=239778,max_limit=239778' > shares
+ # echo 'res=mem,total_guarantee=257512,max_limit=257512' > shares
# cat shares
- res=mem,guarantee=-2,limit=-2,total_guarantee=239778,max_limit=239778
+ res=mem,guarantee=-2,limit=-2,total_guarantee=257512,max_limit=257512
The number of pages a class can use be anywhere between its guarantee and
have allocated may be anywhere between its guarantee and limit, victim
pages will be choosen from classes that are above their guarantee.
-Pages will be freed from classes that are close to their "limit" before
-freeing pages from the classes that are close to their guarantee. Pages
-belonging to classes that are below their guarantee will not be chosen as
-a victim.
+Victim class will be chosen by the number pages a class is using over its
+guarantee. i.e a class that is using 10000 pages over its guarantee will be
+chosen against a class that is using 1000 pages over its guarantee.
+Pages belonging to classes that are below their guarantee will not be
+chosen as a victim.
+
+2. Configuaration parameters
+---------------------------
+
+Memory controller provides the following configuration parameters. Usage of
+these parameters will be made clear in the following section.
+
+fail_over: When pages are being allocated, if the class is over fail_over % of
+ its limit, then fail the memory allocation. Default is 110.
+ ex: If limit of a class is 30000 and fail_over is 110, then memory
+ allocations would start failing once the class is using more than 33000
+ pages.
-2. Core Design
+shrink_at: When a class is using shrink_at % of its limit, then start
+ shrinking the class, i.e start freeing the page to make more free pages
+ available for this class. Default is 90.
+ ex: If limit of a class is 30000 and shrink_at is 90, then pages from this
+ class will start to get freed when the class's usage is above 27000
+
+shrink_to: When a class reached shrink_at % of its limit, ckrm will try to
+ shrink the class's usage to shrink_to %. Defalut is 80.
+ ex: If limit of a class is 30000 with shrink_at being 90 and shrink_to
+ being 80, then ckrm will try to free pages from the class when its
+ usage reaches 27000 and will try to bring it down to 24000.
+
+num_shrinks: Number of shrink attempts ckrm will do within shrink_interval
+ seconds. After this many attempts in a period, ckrm will not attempt a
+ shrink even if the class's usage goes over shrink_at %. Default is 10.
+
+shrink_interval: Number of seconds in a shrink period. Default is 10.
+
+3. Design
--------------------------
CKRM memory resource controller taps at appropriate low level memory
management functions to associate a page with a class and to charge
a class that brings the page to the LRU list.
-2.1 Changes in page allocation function(__alloc_pages())
+CKRM maintains lru lists per-class instead of keeping it system-wide, so
+that reducing a class's usage doesn't involve going through the system-wide
+lru lists.
+
+3.1 Changes in page allocation function(__alloc_pages())
--------------------------------------------------------
-- If the class that the current task belong to is over 110% of its 'limit',
- allocation of page(s) fail.
-- After succesful allocation of a page, the page is attached with the class
- to which the current task belongs to.
+- If the class that the current task belong to is over 'fail_over' % of its
+ 'limit', allocation of page(s) fail. Otherwise, the page allocation will
+ proceed as before.
- Note that the class is _not_ charged for the page(s) here.
-2.2 Changes in page free(free_pages_bulk())
+3.2 Changes in page free(free_pages_bulk())
-------------------------------------------
-- page is freed from the class it belongs to.
+- If the page still belong to a class, the class will be credited for this
+ page.
-2.3 Adding/Deleting page to active/inactive list
+3.3 Adding/Deleting page to active/inactive list
-------------------------------------------------
When a page is added to the active or inactive list, the class that the
-page belongs to is charged for the page usage.
+task belongs to is charged for the page usage.
When a page is deleted from the active or inactive list, the class that the
page belongs to is credited back.
-If a class uses upto its limit, attempt is made to shrink the class's usage
-to 90% of its limit, in order to help the class stay within its limit.
+If a class uses 'shrink_at' % of its limit, attempt is made to shrink
+the class's usage to 'shrink_to' % of its limit, in order to help the class
+stay within its limit.
But, if the class is aggressive, and keep getting over the class's limit
-often(more than 10 shrink events in 10 seconds), then the memory resource
-controller gives up on the class and doesn't try to shrink the class, which
-will eventually lead the class to reach its 110% of its limit and then the
-page allocations will start failing.
+often(more than such 'num_shrinks' events in 'shrink_interval' seconds),
+then the memory resource controller gives up on the class and doesn't try
+to shrink the class, which will eventually lead the class to reach
+fail_over % and then the page allocations will start failing.
-2.4 Chages in the page reclaimation path (refill_inactive_zone and shrink_zone)
+3.4 Changes in the page reclaimation path (refill_inactive_zone and shrink_zone)
-------------------------------------------------------------------------------
Pages will be moved from active to inactive list(refill_inactive_zone) and
-pages from inactive list will be freed in the following order:
-(range is calculated by subtracting 'guarantee' from 'limit')
- - Classes that are over 110% of their range
- - Classes that are over 100% of their range
- - Classes that are over 75% of their range
- - Classes that are over 50% of their range
- - Classes that are over 25% of their range
- - Classes whose parent is over 110% of its range
- - Classes that are over their guarantee
-
-2.5 Handling of Shared pages
+pages from inactive list by choosing victim classes. Victim classes are
+chosen depending on their usage over their guarantee.
+
+Classes with DONT_CARE guarantee are assumed an implicit guarantee which is
+based on the number of children(with DONT_CARE guarantee) its parent has
+(including the default class) and the unused pages its parent still has.
+ex1: If a default root class /rcfs/taskclass has 3 children c1, c2 and c3
+and has 200000 pages, and all the classes have DONT_CARE guarantees, then
+all the classes (c1, c2, c3 and the default class of /rcfs/taskclass) will
+get 50000 (200000 / 4) pages each.
+ex2: If, in the above example c1 is set with a guarantee of 80000 pages,
+then the other classes (c2, c3 and the default class of /rcfs/taskclass)
+will get 40000 ((200000 - 80000) / 3) pages each.
+
+3.5 Handling of Shared pages
----------------------------
Even if a mm is shared by tasks, the pages that belong to the mm will be
charged against the individual tasks that bring the page into LRU.
executed in the default class (/rcfs/taskclass).
Initially, the systemwide default class gets 100% of the LRU pages, and the
-config file displays the total number of physical pages.
+stats file at the /rcfs/taskclass level displays the total number of
+physical pages.
# cd /rcfs/taskclass
- # cat config
- res=mem;tot_pages=239778,active=60473,inactive=135285,free=44555
+ # grep System stats
+ System: tot_pages=239778,active=60473,inactive=135285,free=44555
# cat shares
res=mem,guarantee=-2,limit=-2,total_guarantee=100,max_limit=100
tot_pages - total number of pages
active - number of pages in the active list ( sum of all zones)
- inactive - number of pages in the inactive list ( sum of all zones )
- free - number of free pages (sum of all pages)
+ inactive - number of pages in the inactive list ( sum of all zones)
+ free - number of free pages (sum of all zones)
- By making total_guarantee and max_limit to be same as tot_pages, one make
+ By making total_guarantee and max_limit to be same as tot_pages, one can
make the numbers in shares file be same as the number of pages for a
class.
# cat shares
res=mem,guarantee=-2,limit=-2,total_guarantee=239778,max_limit=239778
+Changing configuration parameters:
+----------------------------------
+For description of the paramters read the file mem_rc.design in this same directory.
+
+Following is the default values for the configuration parameters:
+
+ localhost:~ # cd /rcfs/taskclass
+ localhost:/rcfs/taskclass # cat config
+ res=mem,fail_over=110,shrink_at=90,shrink_to=80,num_shrinks=10,shrink_interval=10
+
+Here is how to change a specific configuration parameter. Note that more than one
+configuration parameter can be changed in a single echo command though for simplicity
+we show one per echo.
+
+ex: Changing fail_over:
+ localhost:/rcfs/taskclass # echo "res=mem,fail_over=120" > config
+ localhost:/rcfs/taskclass # cat config
+ res=mem,fail_over=120,shrink_at=90,shrink_to=80,num_shrinks=10,shrink_interval=10
+
+ex: Changing shrink_at:
+ localhost:/rcfs/taskclass # echo "res=mem,shrink_at=85" > config
+ localhost:/rcfs/taskclass # cat config
+ res=mem,fail_over=120,shrink_at=85,shrink_to=80,num_shrinks=10,shrink_interval=10
+
+ex: Changing shrink_to:
+ localhost:/rcfs/taskclass # echo "res=mem,shrink_to=75" > config
+ localhost:/rcfs/taskclass # cat config
+ res=mem,fail_over=120,shrink_at=85,shrink_to=75,num_shrinks=10,shrink_interval=10
+
+ex: Changing num_shrinks:
+ localhost:/rcfs/taskclass # echo "res=mem,num_shrinks=20" > config
+ localhost:/rcfs/taskclass # cat config
+ res=mem,fail_over=120,shrink_at=85,shrink_to=75,num_shrinks=20,shrink_interval=10
+
+ex: Changing shrink_interval:
+ localhost:/rcfs/taskclass # echo "res=mem,shrink_interval=15" > config
+ localhost:/rcfs/taskclass # cat config
+ res=mem,fail_over=120,shrink_at=85,shrink_to=75,num_shrinks=20,shrink_interval=15
Class creation
--------------
# mkdir c1
-Its initial share is don't care. The parent's share values will be unchanged.
+Its initial share is DONT_CARE. The parent's share values will be unchanged.
Setting a new class share
-------------------------
stats file shows statistics of the page usage of a class
# cat stats
----------- Memory Resource stats start -----------
+ System: tot_pages=239778,active=60473,inactive=135285,free=44555
Number of pages used(including pages lent to children): 196654
Number of pages guaranteed: 239778
Maximum limit of pages: 239778
VERSION = 2
PATCHLEVEL = 6
SUBLEVEL = 10
-EXTRAVERSION =
+EXTRAVERSION = -1.14_FC2.1.planetlab.2005.03.30
NAME=Woozy Numbat
# *DOCUMENTATION*
#include <linux/syscalls.h>
#include <linux/rmap.h>
#include <linux/ckrm_events.h>
+#include <linux/ckrm_mem_inline.h>
#include <asm/uaccess.h>
#include <asm/mmu_context.h>
activate_mm(active_mm, mm);
task_unlock(tsk);
arch_pick_mmap_layout(mm);
+ ckrm_task_mm_change(tsk, old_mm, mm);
if (old_mm) {
if (active_mm != old_mm) BUG();
mmput(old_mm);
PROC_TID_ATTR_PREV,
PROC_TID_ATTR_EXEC,
PROC_TID_ATTR_FSCREATE,
+#endif
+#ifdef CONFIG_DELAY_ACCT
+ PROC_TID_DELAY_ACCT,
+ PROC_TGID_DELAY_ACCT,
#endif
PROC_TID_FD_DIR = 0x8000, /* 0x8000-0xffff */
};
#ifdef CONFIG_SECURITY
E(PROC_TGID_ATTR, "attr", S_IFDIR|S_IRUGO|S_IXUGO),
#endif
+#ifdef CONFIG_DELAY_ACCT
+ E(PROC_TGID_DELAY_ACCT,"delay", S_IFREG|S_IRUGO),
+#endif
#ifdef CONFIG_KALLSYMS
E(PROC_TGID_WCHAN, "wchan", S_IFREG|S_IRUGO),
#endif
#ifdef CONFIG_SECURITY
E(PROC_TID_ATTR, "attr", S_IFDIR|S_IRUGO|S_IXUGO),
#endif
+#ifdef CONFIG_DELAY_ACCT
+ E(PROC_TGID_DELAY_ACCT,"delay", S_IFREG|S_IRUGO),
+#endif
#ifdef CONFIG_KALLSYMS
E(PROC_TID_WCHAN, "wchan", S_IFREG|S_IRUGO),
#endif
int proc_tgid_stat(struct task_struct*,char*);
int proc_pid_status(struct task_struct*,char*);
int proc_pid_statm(struct task_struct*,char*);
+int proc_pid_delay(struct task_struct*,char*);
static int proc_fd_link(struct inode *inode, struct dentry **dentry, struct vfsmount **mnt)
{
ei->op.proc_read = proc_pid_wchan;
break;
#endif
+#ifdef CONFIG_DELAY_ACCT
+ case PROC_TID_DELAY_ACCT:
+ case PROC_TGID_DELAY_ACCT:
+ inode->i_fop = &proc_info_file_operations;
+ ei->op.proc_read = proc_pid_delay;
+ break;
+#endif
#ifdef CONFIG_SCHEDSTATS
case PROC_TID_SCHEDSTAT:
case PROC_TGID_SCHEDSTAT:
* Copyright (C) Jiantao Kong, IBM Corp. 2003
* (C) Shailabh Nagar, IBM Corp. 2003
* (C) Chandra Seetharaman, IBM Corp. 2004
- *
- *
- * Memory control functions of the CKRM kernel API
+ *
+ *
+ * Memory control functions of the CKRM kernel API
*
* Latest version, more details at http://ckrm.sf.net
- *
+ *
* This program is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation; either version 2 of the License, or
*
*/
-/* Changes
- *
- * 28 Aug 2003
- * Created.
- */
-
#ifndef _LINUX_CKRM_MEM_H
#define _LINUX_CKRM_MEM_H
#include <linux/list.h>
#include <linux/ckrm_rc.h>
+#include <linux/kref.h>
+
+struct ckrm_zone {
+ struct list_head active_list;
+ struct list_head inactive_list;
+
+ unsigned long nr_active;
+ unsigned long nr_inactive;
+ unsigned long active_over;
+ unsigned long inactive_over;
+
+ unsigned long shrink_active;
+ unsigned long shrink_inactive;
+ long shrink_weight;
+ unsigned long shrink_flag;
+ struct list_head victim_list; /* list of ckrm_zones chosen for
+ * shrinking. These are over their
+ * 'guarantee'
+ */
+ struct zone *zone;
+ struct ckrm_mem_res *memcls;
+};
+
+struct ckrm_mem_res {
+ unsigned long flags;
+ struct ckrm_core_class *core; /* the core i am part of... */
+ struct ckrm_core_class *parent; /* parent of the core i am part of */
+ struct ckrm_shares shares;
+ struct list_head mcls_list; /* list of all 1-level classes */
+ struct kref nr_users; /* ref count */
+ atomic_t pg_total; /* # of pages used by this class */
+ int pg_guar; /* absolute # of guarantee */
+ int pg_limit; /* absolute # of limit */
+ int pg_borrowed; /* # of pages borrowed from parent */
+ int pg_lent; /* # of pages lent to children */
+ int pg_unused; /* # of pages left to this class
+ * (after giving the guarantees to
+ * children. need to borrow from
+ * parent if more than this is needed.
+ */
+ int hier; /* hiearchy level, root = 0 */
+ int impl_guar; /* for classes with don't care guar */
+ int nr_dontcare; /* # of dont care children */
-typedef struct ckrm_mem_res {
- unsigned long reclaim_flags;
- unsigned long flags;
- struct ckrm_core_class *core; // the core i am part of...
- struct ckrm_core_class *parent; // parent of the core i am part of....
- struct ckrm_shares shares;
- struct list_head mcls_list; // list of all 1-level classes
- struct list_head shrink_list; // list of classes need to be shrunk
- atomic_t nr_users; // # of references to this class/data structure
- atomic_t pg_total; // # of pages used by this class
- int pg_guar; // # of pages this class is guaranteed
- int pg_limit; // max # of pages this class can get
- int pg_borrowed; // # of pages this class borrowed from its parent
- int pg_lent; // # of pages this class lent to its children
- int pg_unused; // # of pages left to this class (after giving the
- // guarantees to children. need to borrow from parent if
- // more than this is needed.
- int nr_active[MAX_NR_ZONES];
- int nr_inactive[MAX_NR_ZONES];
- int tmp_cnt;
+ struct ckrm_zone ckrm_zone[MAX_NR_ZONES];
+
+ struct list_head shrink_list; /* list of classes that are near
+ * limit and need to be shrunk
+ */
int shrink_count;
unsigned long last_shrink;
- int over_limit_failures;
- int hier; // hiearchy, root = 0
-} ckrm_mem_res_t;
+};
+
+#define CLS_SHRINK_BIT (1)
+
+#define CLS_AT_LIMIT (1)
extern atomic_t ckrm_mem_real_count;
-extern unsigned int ckrm_tot_lru_pages;
+extern struct ckrm_res_ctlr mem_rcbs;
+extern struct ckrm_mem_res *ckrm_mem_root_class;
+extern struct list_head ckrm_memclass_list;
extern struct list_head ckrm_shrink_list;
extern spinlock_t ckrm_mem_lock;
-extern struct ckrm_res_ctlr mem_rcbs;
-
-#define page_class(page) ((ckrm_mem_res_t*)((page)->memclass))
-
-// used to fill reclaim_flags, used only when memory is low in the system
-#define CLS_CLEAR (0) // class under its guarantee
-#define CLS_OVER_GUAR (1 << 0) // class is over its guarantee
-#define CLS_PARENT_OVER (1 << 1) // parent is over 110% mark over limit
-#define CLS_OVER_25 (1 << 2) // class over 25% mark bet guar(0) & limit(100)
-#define CLS_OVER_50 (1 << 3) // class over 50% mark bet guar(0) & limit(100)
-#define CLS_OVER_75 (1 << 4) // class over 75% mark bet guar(0) & limit(100)
-#define CLS_OVER_100 (1 << 5) // class over its limit
-#define CLS_OVER_110 (1 << 6) // class over 110% mark over limit
-#define CLS_FLAGS_ALL ( CLS_OVER_GUAR | CLS_PARENT_OVER | CLS_OVER_25 | \
- CLS_OVER_50 | CLS_OVER_75 | CLS_OVER_100 | CLS_OVER_110 )
-#define CLS_SHRINK_BIT (31) // used to both lock and set the bit
-#define CLS_SHRINK (1 << CLS_SHRINK_BIT) // shrink the given class
-
-// used in flags. set when a class is more than 90% of its maxlimit
-#define MEM_AT_LIMIT 1
-
-extern void ckrm_set_aggressive(ckrm_mem_res_t *);
-extern unsigned int ckrm_setup_reclamation(void);
-extern void ckrm_teardown_reclamation(void);
-extern void ckrm_get_reclaim_bits(unsigned int *, unsigned int *);
-extern void ckrm_init_mm_to_task(struct mm_struct *, struct task_struct *);
-extern void ckrm_mem_evaluate_mm(struct mm_struct *);
-extern void ckrm_at_limit(ckrm_mem_res_t *);
-extern int ckrm_memclass_valid(ckrm_mem_res_t *);
-#define ckrm_get_reclaim_flags(cls) ((cls)->reclaim_flags)
+extern int ckrm_nr_mem_classes;
+extern unsigned int ckrm_tot_lru_pages;
+extern int ckrm_mem_shrink_count;
+extern int ckrm_mem_shrink_to;
+extern int ckrm_mem_shrink_interval ;
+extern void ckrm_mem_migrate_mm(struct mm_struct *, struct ckrm_mem_res *);
+extern void ckrm_mem_migrate_all_pages(struct ckrm_mem_res *,
+ struct ckrm_mem_res *);
+extern void memclass_release(struct kref *);
+extern void shrink_get_victims(struct zone *, unsigned long ,
+ unsigned long, struct list_head *);
+extern void ckrm_shrink_atlimit(struct ckrm_mem_res *);
#else
-#define ckrm_init_mm_to_current(a) do {} while (0)
-#define ckrm_mem_evaluate_mm(a) do {} while (0)
-#define ckrm_get_reclaim_flags(a) (0)
-#define ckrm_setup_reclamation() (0)
-#define ckrm_teardown_reclamation() do {} while (0)
-#define ckrm_get_reclaim_bits(a, b) do { *(a) = 0; *(b)= 0; } while (0)
-#define ckrm_init_mm_to_task(a,b) do {} while (0)
-
-#endif // CONFIG_CKRM_RES_MEM
+#define ckrm_mem_migrate_mm(a, b) do {} while (0)
+#define ckrm_mem_migrate_all_pages(a, b) do {} while (0)
-#endif //_LINUX_CKRM_MEM_H
+#endif /* CONFIG_CKRM_RES_MEM */
+#endif /* _LINUX_CKRM_MEM_H */
* Copyright (C) Jiantao Kong, IBM Corp. 2003
* (C) Shailabh Nagar, IBM Corp. 2003
* (C) Chandra Seetharaman, IBM Corp. 2004
- *
- *
- * Memory control functions of the CKRM kernel API
+ *
+ *
+ * Memory control functions of the CKRM kernel API
*
* Latest version, more details at http://ckrm.sf.net
- *
+ *
* This program is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation; either version 2 of the License, or
*
*/
-/* Changes
- *
- * 28 Aug 2003
- * Created.
- */
-
-
#ifndef _LINUX_CKRM_MEM_INLINE_H_
#define _LINUX_CKRM_MEM_INLINE_H_
#ifdef CONFIG_CKRM_RES_MEM
-#define GET_MEM_CLASS(tsk) \
- ckrm_get_res_class(tsk->taskclass, mem_rcbs.resid, ckrm_mem_res_t)
+#define ckrm_shrink_list_empty() list_empty(&ckrm_shrink_list)
+
+static inline struct ckrm_mem_res *
+ckrm_get_mem_class(struct task_struct *tsk)
+{
+ return ckrm_get_res_class(tsk->taskclass, mem_rcbs.resid,
+ struct ckrm_mem_res);
+}
+
+static inline void
+ckrm_set_shrink(struct ckrm_zone *cz)
+{
+ set_bit(CLS_SHRINK_BIT, &cz->shrink_flag);
+}
+
+static inline int
+ckrm_test_set_shrink(struct ckrm_zone *cz)
+{
+ return test_and_set_bit(CLS_SHRINK_BIT, &cz->shrink_flag);
+}
+
+static inline void
+ckrm_clear_shrink(struct ckrm_zone *cz)
+{
+ clear_bit(CLS_SHRINK_BIT, &cz->shrink_flag);
+}
-#define ckrm_set_shrink(cls) \
- set_bit(CLS_SHRINK_BIT, (unsigned long *)&(cls)->reclaim_flags)
-#define ckrm_test_set_shrink(cls) \
- test_and_set_bit(CLS_SHRINK_BIT, (unsigned long *)&(cls)->reclaim_flags)
-#define ckrm_clear_shrink(cls) \
- clear_bit(CLS_SHRINK_BIT, (unsigned long *)&(cls)->reclaim_flags)
+static inline void
+set_page_ckrmzone( struct page *page, struct ckrm_zone *cz)
+{
+ page->ckrm_zone = cz;
+}
-#define ckrm_shrink_list_empty() list_empty(&ckrm_shrink_list)
+static inline struct ckrm_zone *
+page_ckrmzone(struct page *page)
+{
+ return page->ckrm_zone;
+}
/*
- * Currently, the class of an address is assigned to the class with max
- * available guarantee. Simply replace this function for other policies.
+ * Currently, a shared page that is shared by multiple classes is charged
+ * to a class with max available guarantee. Simply replace this function
+ * for other policies.
*/
static inline int
-ckrm_mem_share_compare(ckrm_mem_res_t *a, ckrm_mem_res_t *b)
+ckrm_mem_share_compare(struct ckrm_mem_res *a, struct ckrm_mem_res *b)
{
- if (a == NULL)
- return -(b != NULL) ;
+ if (a == NULL)
+ return -(b != NULL);
if (b == NULL)
+ return 1;
+ if (a->pg_guar == b->pg_guar)
return 0;
if (a->pg_guar == CKRM_SHARE_DONTCARE)
return 1;
}
static inline void
-mem_class_get(ckrm_mem_res_t *cls)
+incr_use_count(struct ckrm_mem_res *cls, int borrow)
{
- if (cls)
- atomic_inc(&((cls)->nr_users));
-}
+ extern int ckrm_mem_shrink_at;
+ struct ckrm_mem_res *parcls = ckrm_get_res_class(cls->parent,
+ mem_rcbs.resid, struct ckrm_mem_res);
-static inline void
-mem_class_put(ckrm_mem_res_t *cls)
-{
-
- if (cls && atomic_dec_and_test(&(cls->nr_users)) ) {
- printk("freeing memclass %p of <core:%s>\n", cls, cls->core->name);
- BUG_ON(ckrm_memclass_valid(cls));
- //kfree(cls);
- }
-}
+ if (!cls)
+ return;
-static inline void
-incr_use_count(ckrm_mem_res_t *cls, int borrow)
-{
atomic_inc(&cls->pg_total);
-
- if (borrow)
+ if (borrow)
cls->pg_lent++;
- if ((cls->pg_guar == CKRM_SHARE_DONTCARE) ||
- (atomic_read(&cls->pg_total) > cls->pg_unused)) {
- ckrm_mem_res_t *parcls = ckrm_get_res_class(cls->parent,
- mem_rcbs.resid, ckrm_mem_res_t);
- if (parcls) {
- incr_use_count(parcls, 1);
- cls->pg_borrowed++;
- }
- } else {
+
+ parcls = ckrm_get_res_class(cls->parent,
+ mem_rcbs.resid, struct ckrm_mem_res);
+ if (parcls && ((cls->pg_guar == CKRM_SHARE_DONTCARE) ||
+ (atomic_read(&cls->pg_total) > cls->pg_unused))) {
+ incr_use_count(parcls, 1);
+ cls->pg_borrowed++;
+ } else
atomic_inc(&ckrm_mem_real_count);
- }
- if ((cls->pg_limit != CKRM_SHARE_DONTCARE) &&
- (atomic_read(&cls->pg_total) >= cls->pg_limit) &&
- ((cls->flags & MEM_AT_LIMIT) != MEM_AT_LIMIT)) {
- ckrm_at_limit(cls);
+
+ if ((cls->pg_limit != CKRM_SHARE_DONTCARE) &&
+ (atomic_read(&cls->pg_total) >=
+ ((ckrm_mem_shrink_at * cls->pg_limit) / 100)) &&
+ ((cls->flags & CLS_AT_LIMIT) != CLS_AT_LIMIT)) {
+ ckrm_shrink_atlimit(cls);
}
return;
}
static inline void
-decr_use_count(ckrm_mem_res_t *cls, int borrowed)
+decr_use_count(struct ckrm_mem_res *cls, int borrowed)
{
+ if (!cls)
+ return;
atomic_dec(&cls->pg_total);
if (borrowed)
cls->pg_lent--;
if (cls->pg_borrowed > 0) {
- ckrm_mem_res_t *parcls = ckrm_get_res_class(cls->parent,
- mem_rcbs.resid, ckrm_mem_res_t);
+ struct ckrm_mem_res *parcls = ckrm_get_res_class(cls->parent,
+ mem_rcbs.resid, struct ckrm_mem_res);
if (parcls) {
decr_use_count(parcls, 1);
cls->pg_borrowed--;
}
static inline void
-ckrm_set_page_class(struct page *page, ckrm_mem_res_t *cls)
+ckrm_set_page_class(struct page *page, struct ckrm_mem_res *cls)
{
- if (mem_rcbs.resid != -1 && cls != NULL) {
- if (unlikely(page->memclass)) {
- mem_class_put(page->memclass);
+ struct ckrm_zone *new_czone, *old_czone;
+
+ if (!cls) {
+ if (!ckrm_mem_root_class) {
+ set_page_ckrmzone(page, NULL);
+ return;
}
- page->memclass = cls;
- mem_class_get(cls);
- } else {
- page->memclass = NULL;
+ cls = ckrm_mem_root_class;
}
+ new_czone = &cls->ckrm_zone[page_zonenum(page)];
+ old_czone = page_ckrmzone(page);
+
+ if (old_czone)
+ kref_put(&old_czone->memcls->nr_users, memclass_release);
+
+ set_page_ckrmzone(page, new_czone);
+ kref_get(&cls->nr_users);
+ incr_use_count(cls, 0);
+ SetPageCkrmAccount(page);
}
static inline void
-ckrm_set_pages_class(struct page *pages, int numpages, ckrm_mem_res_t *cls)
+ckrm_change_page_class(struct page *page, struct ckrm_mem_res *newcls)
{
- int i;
- for (i = 0; i < numpages; pages++, i++) {
- ckrm_set_page_class(pages, cls);
+ struct ckrm_zone *old_czone = page_ckrmzone(page), *new_czone;
+ struct ckrm_mem_res *oldcls;
+
+ if (!newcls) {
+ if (!ckrm_mem_root_class)
+ return;
+ newcls = ckrm_mem_root_class;
+ }
+
+ oldcls = old_czone->memcls;
+ if (oldcls == newcls)
+ return;
+
+ if (oldcls) {
+ kref_put(&oldcls->nr_users, memclass_release);
+ decr_use_count(oldcls, 0);
+ }
+
+ new_czone = &newcls->ckrm_zone[page_zonenum(page)];
+ set_page_ckrmzone(page, new_czone);
+ kref_get(&newcls->nr_users);
+ incr_use_count(newcls, 0);
+
+ list_del(&page->lru);
+ if (PageActive(page)) {
+ old_czone->nr_active--;
+ new_czone->nr_active++;
+ list_add(&page->lru, &new_czone->active_list);
+ } else {
+ old_czone->nr_inactive--;
+ new_czone->nr_inactive++;
+ list_add(&page->lru, &new_czone->inactive_list);
}
}
static inline void
ckrm_clear_page_class(struct page *page)
{
- if (page->memclass != NULL) {
- mem_class_put(page->memclass);
- page->memclass = NULL;
+ struct ckrm_zone *czone = page_ckrmzone(page);
+ if (czone != NULL) {
+ if (PageCkrmAccount(page)) {
+ decr_use_count(czone->memcls, 0);
+ ClearPageCkrmAccount(page);
+ }
+ kref_put(&czone->memcls->nr_users, memclass_release);
+ set_page_ckrmzone(page, NULL);
}
}
static inline void
-ckrm_clear_pages_class(struct page *pages, int numpages)
+ckrm_mem_inc_active(struct page *page)
{
- int i;
- for (i = 0; i < numpages; pages++, i++) {
- ckrm_clear_page_class(pages);
- }
+ struct ckrm_mem_res *cls = ckrm_get_mem_class(current)
+ ?: ckrm_mem_root_class;
+ struct ckrm_zone *czone;
+
+ if (cls == NULL)
+ return;
+
+ ckrm_set_page_class(page, cls);
+ czone = page_ckrmzone(page);
+ czone->nr_active++;
+ list_add(&page->lru, &czone->active_list);
}
static inline void
-ckrm_change_page_class(struct page *page, ckrm_mem_res_t *newcls)
+ckrm_mem_dec_active(struct page *page)
{
- ckrm_mem_res_t *oldcls = page_class(page);
-
- if (!newcls || oldcls == newcls)
+ struct ckrm_zone *czone = page_ckrmzone(page);
+ if (czone == NULL)
return;
+ list_del(&page->lru);
+ czone->nr_active--;
ckrm_clear_page_class(page);
- ckrm_set_page_class(page, newcls);
- if (test_bit(PG_ckrm_account, &page->flags)) {
- decr_use_count(oldcls, 0);
- incr_use_count(newcls, 0);
- if (PageActive(page)) {
- oldcls->nr_active[page_zonenum(page)]--;
- newcls->nr_active[page_zonenum(page)]++;
- } else {
- oldcls->nr_inactive[page_zonenum(page)]--;
- newcls->nr_inactive[page_zonenum(page)]++;
- }
- }
}
+
static inline void
-ckrm_change_pages_class(struct page *pages, int numpages,
- ckrm_mem_res_t *cls)
+ckrm_mem_inc_inactive(struct page *page)
{
- int i;
- for (i = 0; i < numpages; pages++, i++) {
- ckrm_change_page_class(pages, cls);
- }
+ struct ckrm_mem_res *cls = ckrm_get_mem_class(current)
+ ?: ckrm_mem_root_class;
+ struct ckrm_zone *czone;
+
+ if (cls == NULL)
+ return;
+
+ ckrm_set_page_class(page, cls);
+ czone = page_ckrmzone(page);
+ czone->nr_inactive++;
+ list_add(&page->lru, &czone->inactive_list);
}
static inline void
-ckrm_mem_inc_active(struct page *page)
+ckrm_mem_dec_inactive(struct page *page)
{
- ckrm_mem_res_t *cls = page_class(page), *curcls;
- if (!cls) {
+ struct ckrm_zone *czone = page_ckrmzone(page);
+ if (czone == NULL)
return;
- }
- BUG_ON(test_bit(PG_ckrm_account, &page->flags));
- if (unlikely(cls != (curcls = GET_MEM_CLASS(current)))) {
- cls = curcls;
- ckrm_change_page_class(page, cls);
- }
- cls->nr_active[page_zonenum(page)]++;
- incr_use_count(cls, 0);
- set_bit(PG_ckrm_account, &page->flags);
+
+ czone->nr_inactive--;
+ list_del(&page->lru);
+ ckrm_clear_page_class(page);
}
static inline void
-ckrm_mem_dec_active(struct page *page)
+ckrm_zone_add_active(struct ckrm_zone *czone, int cnt)
{
- ckrm_mem_res_t *cls = page_class(page);
- if (!cls) {
- return;
- }
- BUG_ON(!test_bit(PG_ckrm_account, &page->flags));
- cls->nr_active[page_zonenum(page)]--;
- decr_use_count(cls, 0);
- clear_bit(PG_ckrm_account, &page->flags);
+ czone->nr_active += cnt;
}
static inline void
-ckrm_mem_inc_inactive(struct page *page)
+ckrm_zone_add_inactive(struct ckrm_zone *czone, int cnt)
{
- ckrm_mem_res_t *cls = page_class(page), *curcls;
- if (!cls) {
- return;
- }
- BUG_ON(test_bit(PG_ckrm_account, &page->flags));
- if (unlikely(cls != (curcls = GET_MEM_CLASS(current)))) {
- cls = curcls;
- ckrm_change_page_class(page, cls);
- }
- cls->nr_inactive[page_zonenum(page)]++;
- incr_use_count(cls, 0);
- set_bit(PG_ckrm_account, &page->flags);
+ czone->nr_inactive += cnt;
}
static inline void
-ckrm_mem_dec_inactive(struct page *page)
+ckrm_zone_sub_active(struct ckrm_zone *czone, int cnt)
{
- ckrm_mem_res_t *cls = page_class(page);
- if (!cls) {
- return;
- }
- BUG_ON(!test_bit(PG_ckrm_account, &page->flags));
- cls->nr_inactive[page_zonenum(page)]--;
- decr_use_count(cls, 0);
- clear_bit(PG_ckrm_account, &page->flags);
+ czone->nr_active -= cnt;
}
-static inline int
-ckrm_kick_page(struct page *page, unsigned int bits)
+static inline void
+ckrm_zone_sub_inactive(struct ckrm_zone *czone, int cnt)
{
- if (page_class(page) == NULL) {
- return bits;
- } else {
- return (page_class(page)->reclaim_flags & bits);
- }
+ czone->nr_inactive -= cnt;
}
-static inline int
-ckrm_class_limit_ok(ckrm_mem_res_t *cls)
+static inline int
+ckrm_class_limit_ok(struct ckrm_mem_res *cls)
{
+ int ret;
+
if ((mem_rcbs.resid == -1) || !cls) {
return 1;
}
if (cls->pg_limit == CKRM_SHARE_DONTCARE) {
- ckrm_mem_res_t *parcls = ckrm_get_res_class(cls->parent,
- mem_rcbs.resid, ckrm_mem_res_t);
- return (!parcls ?: ckrm_class_limit_ok(parcls));
- } else {
- return (atomic_read(&cls->pg_total) <= (11 * cls->pg_limit) / 10);
+ struct ckrm_mem_res *parcls = ckrm_get_res_class(cls->parent,
+ mem_rcbs.resid, struct ckrm_mem_res);
+ ret = (parcls ? ckrm_class_limit_ok(parcls) : 0);
+ } else
+ ret = (atomic_read(&cls->pg_total) <= cls->pg_limit);
+
+ /* If we are failing, just nudge the back end */
+ if (ret == 0)
+ ckrm_shrink_atlimit(cls);
+
+ return ret;
+}
+
+static inline void
+ckrm_page_init(struct page *page)
+{
+ page->flags &= ~(1 << PG_ckrm_account);
+ set_page_ckrmzone(page, NULL);
+}
+
+
+/* task/mm initializations/cleanup */
+
+static inline void
+ckrm_task_mm_init(struct task_struct *tsk)
+{
+ INIT_LIST_HEAD(&tsk->mm_peers);
+}
+
+static inline void
+ckrm_task_mm_set(struct mm_struct * mm, struct task_struct *task)
+{
+ spin_lock(&mm->peertask_lock);
+ if (!list_empty(&task->mm_peers)) {
+ printk(KERN_ERR "MEM_RC: Task list NOT empty!! emptying...\n");
+ list_del_init(&task->mm_peers);
+ }
+ list_add_tail(&task->mm_peers, &mm->tasklist);
+ spin_unlock(&mm->peertask_lock);
+ if (mm->memclass != ckrm_get_mem_class(task))
+ ckrm_mem_migrate_mm(mm, NULL);
+ return;
+}
+
+static inline void
+ckrm_task_mm_change(struct task_struct *tsk,
+ struct mm_struct *oldmm, struct mm_struct *newmm)
+{
+ if (oldmm) {
+ spin_lock(&oldmm->peertask_lock);
+ list_del(&tsk->mm_peers);
+ ckrm_mem_migrate_mm(oldmm, NULL);
+ spin_unlock(&oldmm->peertask_lock);
}
+ spin_lock(&newmm->peertask_lock);
+ list_add_tail(&tsk->mm_peers, &newmm->tasklist);
+ ckrm_mem_migrate_mm(newmm, NULL);
+ spin_unlock(&newmm->peertask_lock);
}
-#else // !CONFIG_CKRM_RES_MEM
+static inline void
+ckrm_task_mm_clear(struct task_struct *tsk, struct mm_struct *mm)
+{
+ spin_lock(&mm->peertask_lock);
+ list_del_init(&tsk->mm_peers);
+ ckrm_mem_migrate_mm(mm, NULL);
+ spin_unlock(&mm->peertask_lock);
+}
+
+static inline void
+ckrm_mm_init(struct mm_struct *mm)
+{
+ INIT_LIST_HEAD(&mm->tasklist);
+ mm->peertask_lock = SPIN_LOCK_UNLOCKED;
+}
+
+static inline void
+ckrm_mm_setclass(struct mm_struct *mm, struct ckrm_mem_res *cls)
+{
+ mm->memclass = cls;
+ kref_get(&cls->nr_users);
+}
+
+static inline void
+ckrm_mm_clearclass(struct mm_struct *mm)
+{
+ if (mm->memclass) {
+ kref_put(&mm->memclass->nr_users, memclass_release);
+ mm->memclass = NULL;
+ }
+}
+
+static inline void ckrm_init_lists(struct zone *zone) {}
+
+static inline void ckrm_add_tail_inactive(struct page *page)
+{
+ struct ckrm_zone *ckrm_zone = page_ckrmzone(page);
+ list_add_tail(&page->lru, &ckrm_zone->inactive_list);
+}
+
+#else
-#define ckrm_set_page_class(a,b) do{}while(0)
-#define ckrm_set_pages_class(a,b,c) do{}while(0)
-#define ckrm_clear_page_class(a) do{}while(0)
-#define ckrm_clear_pages_class(a,b) do{}while(0)
-#define ckrm_change_page_class(a,b) do{}while(0)
-#define ckrm_change_pages_class(a,b,c) do{}while(0)
-#define ckrm_mem_inc_active(a) do{}while(0)
-#define ckrm_mem_dec_active(a) do{}while(0)
-#define ckrm_mem_inc_inactive(a) do{}while(0)
-#define ckrm_mem_dec_inactive(a) do{}while(0)
#define ckrm_shrink_list_empty() (1)
-#define ckrm_kick_page(a,b) (0)
-#define ckrm_class_limit_ok(a) (1)
-#endif // CONFIG_CKRM_RES_MEM
+static inline void *
+ckrm_get_memclass(struct task_struct *tsk)
+{
+ return NULL;
+}
+
+static inline void ckrm_clear_page_class(struct page *p) {}
+
+static inline void ckrm_mem_inc_active(struct page *p) {}
+static inline void ckrm_mem_dec_active(struct page *p) {}
+static inline void ckrm_mem_inc_inactive(struct page *p) {}
+static inline void ckrm_mem_dec_inactive(struct page *p) {}
+
+#define ckrm_zone_add_active(a, b) do {} while (0)
+#define ckrm_zone_add_inactive(a, b) do {} while (0)
+#define ckrm_zone_sub_active(a, b) do {} while (0)
+#define ckrm_zone_sub_inactive(a, b) do {} while (0)
-#endif // _LINUX_CKRM_MEM_INLINE_H_
+#define ckrm_class_limit_ok(a) (1)
+
+static inline void ckrm_page_init(struct page *p) {}
+static inline void ckrm_task_mm_init(struct task_struct *tsk) {}
+static inline void ckrm_task_mm_set(struct mm_struct * mm,
+ struct task_struct *task) {}
+static inline void ckrm_task_mm_change(struct task_struct *tsk,
+ struct mm_struct *oldmm, struct mm_struct *newmm) {}
+static inline void ckrm_task_mm_clear(struct task_struct *tsk,
+ struct mm_struct *mm) {}
+
+static inline void ckrm_mm_init(struct mm_struct *mm) {}
+
+/* using #define instead of static inline as the prototype requires *
+ * data structures that is available only with the controller enabled */
+#define ckrm_mm_setclass(a, b) do {} while(0)
+
+static inline void ckrm_mm_clearclass(struct mm_struct *mm) {}
+
+static inline void ckrm_init_lists(struct zone *zone)
+{
+ INIT_LIST_HEAD(&zone->active_list);
+ INIT_LIST_HEAD(&zone->inactive_list);
+}
+
+static inline void ckrm_add_tail_inactive(struct page *page)
+{
+ struct zone *zone = page_zone(page);
+ list_add_tail(&page->lru, &zone->inactive_list);
+}
+#endif
+#endif /* _LINUX_CKRM_MEM_INLINE_H_ */
#include <linux/rbtree.h>
#include <linux/prio_tree.h>
#include <linux/fs.h>
+#include <linux/ckrm_mem.h>
struct mempolicy;
struct anon_vma;
void *virtual; /* Kernel virtual address (NULL if
not kmapped, ie. highmem) */
#endif /* WANT_PAGE_VIRTUAL */
+#ifdef CONFIG_CKRM_RES_MEM
+ struct ckrm_zone *ckrm_zone;
+#endif
};
/*
+#include <linux/ckrm_mem_inline.h>
static inline void
add_page_to_active_list(struct zone *zone, struct page *page)
{
+#ifndef CONFIG_CKRM_RES_MEM
list_add(&page->lru, &zone->active_list);
+#endif
zone->nr_active++;
+ ckrm_mem_inc_active(page);
}
static inline void
add_page_to_inactive_list(struct zone *zone, struct page *page)
{
+#ifndef CONFIG_CKRM_RES_MEM
list_add(&page->lru, &zone->inactive_list);
+#endif
zone->nr_inactive++;
+ ckrm_mem_inc_inactive(page);
}
static inline void
del_page_from_active_list(struct zone *zone, struct page *page)
{
+#ifndef CONFIG_CKRM_RES_MEM
list_del(&page->lru);
+#endif
zone->nr_active--;
+ ckrm_mem_dec_active(page);
}
static inline void
del_page_from_inactive_list(struct zone *zone, struct page *page)
{
+#ifndef CONFIG_CKRM_RES_MEM
list_del(&page->lru);
+#endif
zone->nr_inactive--;
+ ckrm_mem_dec_inactive(page);
}
static inline void
del_page_from_lru(struct zone *zone, struct page *page)
{
+#ifndef CONFIG_CKRM_RES_MEM
list_del(&page->lru);
+#endif
if (PageActive(page)) {
ClearPageActive(page);
zone->nr_active--;
+ ckrm_mem_dec_active(page);
} else {
zone->nr_inactive--;
+ ckrm_mem_dec_inactive(page);
}
}
/* Fields commonly accessed by the page reclaim scanner */
spinlock_t lru_lock;
+#ifndef CONFIG_CKRM_RES_MEM
struct list_head active_list;
struct list_head inactive_list;
+#endif
unsigned long nr_scan_active;
unsigned long nr_scan_inactive;
unsigned long nr_active;
#define PG_mappedtodisk 17 /* Has blocks allocated on-disk */
#define PG_reclaim 18 /* To be reclaimed asap */
+#define PG_ckrm_account 20 /* CKRM accounting */
/*
* Global page accounting. One instance per CPU. Only unsigned longs are
#define PageSwapCache(page) 0
#endif
+#ifdef CONFIG_CKRM_RES_MEM
+#define PageCkrmAccount(page) test_bit(PG_ckrm_account, &(page)->flags)
+#define SetPageCkrmAccount(page) set_bit(PG_ckrm_account, &(page)->flags)
+#define ClearPageCkrmAccount(page) clear_bit(PG_ckrm_account, &(page)->flags)
+#endif
+
struct page; /* forward declaration */
int test_clear_page_dirty(struct page *page);
#include <linux/pid.h>
#include <linux/percpu.h>
#include <linux/topology.h>
+#include <linux/taskdelays.h>
struct exec_domain;
struct kioctx *ioctx_list;
struct kioctx default_kioctx;
+
+#ifdef CONFIG_CKRM_RES_MEM
+ struct ckrm_mem_res *memclass;
+ struct list_head tasklist; /* tasks sharing this address space */
+ spinlock_t peertask_lock; /* protect tasklist above */
+#endif
};
struct sighand_struct {
struct mempolicy *mempolicy;
short il_next; /* could be shared with used_math */
#endif
+#ifdef CONFIG_CKRM
+ spinlock_t ckrm_tsklock;
+ void *ce_data;
+#ifdef CONFIG_CKRM_TYPE_TASKCLASS
+ struct ckrm_task_class *taskclass;
+ struct list_head taskclass_link;
+#endif /* CONFIG_CKRM_TYPE_TASKCLASS */
+#ifdef CONFIG_CKRM_RES_MEM
+ struct list_head mm_peers; /* list of tasks using same mm_struct */
+#endif
+#endif /* CONFIG_CKRM */
+#ifdef CONFIG_DELAY_ACCT
+ struct task_delay_info delays;
+#endif
};
static inline pid_t process_group(struct task_struct *tsk)
extern void set_task_comm(struct task_struct *tsk, char *from);
extern void get_task_comm(char *to, struct task_struct *tsk);
+#define PF_MEMIO 0x00400000 /* I am potentially doing I/O for mem */
+#define PF_IOWAIT 0x00800000 /* I am waiting on disk I/O */
+
#ifdef CONFIG_SMP
extern void wait_task_inactive(task_t * p);
#else
#endif
+/* API for registering delay info */
+#ifdef CONFIG_DELAY_ACCT
+
+#define test_delay_flag(tsk,flg) ((tsk)->flags & (flg))
+#define set_delay_flag(tsk,flg) ((tsk)->flags |= (flg))
+#define clear_delay_flag(tsk,flg) ((tsk)->flags &= ~(flg))
+
+#define def_delay_var(var) unsigned long long var
+#define get_delay(tsk,field) ((tsk)->delays.field)
+
+#define start_delay(var) ((var) = sched_clock())
+#define start_delay_set(var,flg) (set_delay_flag(current,flg),(var) = sched_clock())
+
+#define inc_delay(tsk,field) (((tsk)->delays.field)++)
+
+/* because of hardware timer drifts in SMPs and task continue on different cpu
+ * then where the start_ts was taken there is a possibility that
+ * end_ts < start_ts by some usecs. In this case we ignore the diff
+ * and add nothing to the total.
+ */
+#ifdef CONFIG_SMP
+#define test_ts_integrity(start_ts,end_ts) (likely((end_ts) > (start_ts)))
+#else
+#define test_ts_integrity(start_ts,end_ts) (1)
+#endif
+
+#define add_delay_ts(tsk,field,start_ts,end_ts) \
+ do { if (test_ts_integrity(start_ts,end_ts)) (tsk)->delays.field += ((end_ts)-(start_ts)); } while (0)
+
+#define add_delay_clear(tsk,field,start_ts,flg) \
+ do { \
+ unsigned long long now = sched_clock();\
+ add_delay_ts(tsk,field,start_ts,now); \
+ clear_delay_flag(tsk,flg); \
+ } while (0)
+
+static inline void add_io_delay(unsigned long long dstart)
+{
+ struct task_struct * tsk = current;
+ unsigned long long now = sched_clock();
+ unsigned long long val;
+
+ if (test_ts_integrity(dstart,now))
+ val = now - dstart;
+ else
+ val = 0;
+ if (test_delay_flag(tsk,PF_MEMIO)) {
+ tsk->delays.mem_iowait_total += val;
+ tsk->delays.num_memwaits++;
+ } else {
+ tsk->delays.iowait_total += val;
+ tsk->delays.num_iowaits++;
+ }
+ clear_delay_flag(tsk,PF_IOWAIT);
+}
+
+inline static void init_delays(struct task_struct *tsk)
+{
+ memset((void*)&tsk->delays,0,sizeof(tsk->delays));
+}
+
+#else
+
+#define test_delay_flag(tsk,flg) (0)
+#define set_delay_flag(tsk,flg) do { } while (0)
+#define clear_delay_flag(tsk,flg) do { } while (0)
+
+#define def_delay_var(var)
+#define get_delay(tsk,field) (0)
+
+#define start_delay(var) do { } while (0)
+#define start_delay_set(var,flg) do { } while (0)
+
+#define inc_delay(tsk,field) do { } while (0)
+#define add_delay_ts(tsk,field,start_ts,now) do { } while (0)
+#define add_delay_clear(tsk,field,start_ts,flg) do { } while (0)
+#define add_io_delay(dstart) do { } while (0)
+#define init_delays(tsk) do { } while (0)
+#endif
+
#endif /* __KERNEL__ */
#endif
#define TCP_INFO 11 /* Information about this connection. */
#define TCP_QUICKACK 12 /* Block/reenable quick acks */
+#ifdef CONFIG_ACCEPT_QUEUES
+#define TCP_ACCEPTQ_SHARE 13 /* Set accept queue share */
+#endif
+
#define TCPI_OPT_TIMESTAMPS 1
#define TCPI_OPT_SACK 2
#define TCPI_OPT_WSCALE 4
__u32 tcpi_total_retrans;
};
+#ifdef CONFIG_ACCEPT_QUEUES
+
+#define NUM_ACCEPT_QUEUES 8 /* Must be power of 2 */
+
+struct tcp_acceptq_info {
+ unsigned char acceptq_shares;
+ unsigned long acceptq_wait_time;
+ unsigned int acceptq_qcount;
+ unsigned int acceptq_count;
+};
+#endif
+
#ifdef __KERNEL__
#include <linux/config.h>
/* FIFO of established children */
struct open_request *accept_queue;
+#ifndef CONFIG_ACCEPT_QUEUES
struct open_request *accept_queue_tail;
-
+#endif
unsigned int keepalive_time; /* time before keep alive takes place */
unsigned int keepalive_intvl; /* time interval between keep alive probes */
int linger2;
__u32 last_cwnd; /* the last snd_cwnd */
__u32 last_stamp; /* time when updated last_cwnd */
} bictcp;
+
+#ifdef CONFIG_ACCEPT_QUEUES
+ /* move to listen opt... */
+ char class_index;
+ struct {
+ struct open_request *aq_head;
+ struct open_request *aq_tail;
+ unsigned int aq_cnt;
+ unsigned int aq_ratio;
+ unsigned int aq_count;
+ unsigned int aq_qcount;
+ unsigned int aq_backlog;
+ unsigned int aq_wait_time;
+ } acceptq[NUM_ACCEPT_QUEUES];
+#endif
};
/* WARNING: don't change the layout of the members in tcp_sock! */
struct timeval sk_stamp;
struct socket *sk_socket;
void *sk_user_data;
+ void *sk_ns; // For use by CKRM
struct module *sk_owner;
struct page *sk_sndmsg_page;
__u32 sk_sndmsg_off;
return test_bit(flag, &sk->sk_flags);
}
+#ifndef CONFIG_ACCEPT_QUEUES
static inline void sk_acceptq_removed(struct sock *sk)
{
sk->sk_ack_backlog--;
{
return sk->sk_ack_backlog > sk->sk_max_ack_backlog;
}
+#endif
/*
* Compute minimal free write space needed to queue new packets.
struct tcp_v6_open_req v6_req;
#endif
} af;
+#ifdef CONFIG_ACCEPT_QUEUES
+ unsigned long acceptq_time_stamp;
+ int acceptq_class;
+#endif
};
/* SLAB cache for open requests. */
return tcp_win_from_space(sk->sk_rcvbuf);
}
+struct tcp_listen_opt
+{
+ u8 max_qlen_log; /* log_2 of maximal queued SYNs */
+ int qlen;
+#ifdef CONFIG_ACCEPT_QUEUES
+ int qlen_young[NUM_ACCEPT_QUEUES];
+#else
+ int qlen_young;
+#endif
+ int clock_hand;
+ u32 hash_rnd;
+ struct open_request *syn_table[TCP_SYNQ_HSIZE];
+};
+
+#ifdef CONFIG_ACCEPT_QUEUES
+static inline void sk_acceptq_removed(struct sock *sk, int class)
+{
+ tcp_sk(sk)->acceptq[class].aq_backlog--;
+}
+
+static inline void sk_acceptq_added(struct sock *sk, int class)
+{
+ tcp_sk(sk)->acceptq[class].aq_backlog++;
+}
+
+static inline int sk_acceptq_is_full(struct sock *sk, int class)
+{
+ return tcp_sk(sk)->acceptq[class].aq_backlog >
+ sk->sk_max_ack_backlog;
+}
+
+static inline void tcp_set_acceptq(struct tcp_opt *tp, struct open_request *req)
+{
+ int class = req->acceptq_class;
+ int prev_class;
+
+ if (!tp->acceptq[class].aq_ratio) {
+ req->acceptq_class = 0;
+ class = 0;
+ }
+
+ tp->acceptq[class].aq_qcount++;
+ req->acceptq_time_stamp = jiffies;
+
+ if (tp->acceptq[class].aq_tail) {
+ req->dl_next = tp->acceptq[class].aq_tail->dl_next;
+ tp->acceptq[class].aq_tail->dl_next = req;
+ tp->acceptq[class].aq_tail = req;
+ } else { /* if first request in the class */
+ tp->acceptq[class].aq_head = req;
+ tp->acceptq[class].aq_tail = req;
+
+ prev_class = class - 1;
+ while (prev_class >= 0) {
+ if (tp->acceptq[prev_class].aq_tail)
+ break;
+ prev_class--;
+ }
+ if (prev_class < 0) {
+ req->dl_next = tp->accept_queue;
+ tp->accept_queue = req;
+ }
+ else {
+ req->dl_next = tp->acceptq[prev_class].aq_tail->dl_next;
+ tp->acceptq[prev_class].aq_tail->dl_next = req;
+ }
+ }
+}
+static inline void tcp_acceptq_queue(struct sock *sk, struct open_request *req,
+ struct sock *child)
+{
+ tcp_set_acceptq(tcp_sk(sk),req);
+ req->sk = child;
+ sk_acceptq_added(sk,req->acceptq_class);
+}
+
+#else
static inline void tcp_acceptq_queue(struct sock *sk, struct open_request *req,
struct sock *child)
{
req->dl_next = NULL;
}
-struct tcp_listen_opt
+#endif
+
+
+#ifdef CONFIG_ACCEPT_QUEUES
+static inline void
+tcp_synq_removed(struct sock *sk, struct open_request *req)
{
- u8 max_qlen_log; /* log_2 of maximal queued SYNs */
- int qlen;
- int qlen_young;
- int clock_hand;
- u32 hash_rnd;
- struct open_request *syn_table[TCP_SYNQ_HSIZE];
-};
+ struct tcp_listen_opt *lopt = tcp_sk(sk)->listen_opt;
+
+ if (--lopt->qlen == 0)
+ tcp_delete_keepalive_timer(sk);
+ if (req->retrans == 0)
+ lopt->qlen_young[req->acceptq_class]--;
+}
+
+static inline void tcp_synq_added(struct sock *sk, struct open_request *req)
+{
+ struct tcp_listen_opt *lopt = tcp_sk(sk)->listen_opt;
+
+ if (lopt->qlen++ == 0)
+ tcp_reset_keepalive_timer(sk, TCP_TIMEOUT_INIT);
+ lopt->qlen_young[req->acceptq_class]++;
+}
+
+static inline int tcp_synq_len(struct sock *sk)
+{
+ return tcp_sk(sk)->listen_opt->qlen;
+}
+
+static inline int tcp_synq_young(struct sock *sk, int class)
+{
+ return tcp_sk(sk)->listen_opt->qlen_young[class];
+}
+
+#else
static inline void
tcp_synq_removed(struct sock *sk, struct open_request *req)
{
return tcp_sk(sk)->listen_opt->qlen_young;
}
+#endif
static inline int tcp_synq_is_full(struct sock *sk)
{
Say N if unsure
+config CKRM_RES_MEM
+ bool "Class based physical memory controller"
+ default y
+ depends on CKRM
+ help
+ Provide the basic support for collecting physical memory usage
+ information among classes. Say Y if you want to know the memory
+ usage of each class.
+
config CKRM_TYPE_SOCKETCLASS
bool "Class Manager for socket groups"
depends on CKRM && RCFS_FS
obj-$(CONFIG_CKRM_TYPE_SOCKETCLASS) += ckrm_sockc.o
obj-$(CONFIG_CKRM_RES_NUMTASKS) += ckrm_numtasks.o
obj-$(CONFIG_CKRM_RES_LISTENAQ) += ckrm_listenaq.o
+obj-$(CONFIG_CKRM_RES_MEM) += ckrm_memcore.o ckrm_memctlr.o
#include <linux/mempolicy.h>
#include <linux/ckrm_events.h>
#include <linux/syscalls.h>
+#include <linux/ckrm_mem_inline.h>
#include <asm/uaccess.h>
#include <asm/unistd.h>
task_lock(tsk);
tsk->mm = NULL;
up_read(&mm->mmap_sem);
+ ckrm_task_mm_clear(tsk, mm);
enter_lazy_tlb(mm, current);
task_unlock(tsk);
mmput(mm);
#include <linux/rmap.h>
#include <linux/ckrm_events.h>
#include <linux/ckrm_tsk.h>
+#include <linux/ckrm_tc.h>
+#include <linux/ckrm_mem_inline.h>
#include <asm/pgtable.h>
#include <asm/pgalloc.h>
ti->task = tsk;
ckrm_cb_newtask(tsk);
+ ckrm_task_mm_init(tsk);
/* One for us, one for whoever does the "release_task()" (usually parent) */
atomic_set(&tsk->usage,2);
return tsk;
mm->ioctx_list = NULL;
mm->default_kioctx = (struct kioctx)INIT_KIOCTX(mm->default_kioctx, *mm);
mm->free_area_cache = TASK_UNMAPPED_BASE;
+ ckrm_mm_init(mm);
if (likely(!mm_alloc_pgd(mm))) {
mm->def_flags = 0;
if (mm) {
memset(mm, 0, sizeof(*mm));
mm = mm_init(mm);
+ ckrm_mm_setclass(mm, ckrm_get_mem_class(current));
}
return mm;
}
good_mm:
tsk->mm = mm;
tsk->active_mm = mm;
+ ckrm_mm_setclass(mm, oldmm->memclass);
+ ckrm_task_mm_set(mm, tsk);
return 0;
free_pt:
#define task_rq(p) cpu_rq(task_cpu(p))
#define cpu_curr(cpu) (cpu_rq(cpu)->curr)
+#define task_is_running(p) (this_rq() == task_rq(p))
+
/*
* Default context-switch locking:
*/
clear_tsk_need_resched(prev);
rcu_qsctr_inc(task_cpu(prev));
+ add_delay_ts(prev, runcpu_total, prev->timestamp, now);
prev->sleep_avg -= run_time;
if ((long)prev->sleep_avg <= 0) {
prev->sleep_avg = 0;
sched_info_switch(prev, next);
if (likely(prev != next)) {
+ add_delay_ts(next, waitcpu_total, next->timestamp, now);
+ inc_delay(next, runs);
next->timestamp = now;
rq->nr_switches++;
rq->curr = next;
{
struct runqueue *rq = this_rq();
+ def_delay_var(dstart);
+ start_delay_set(dstart, PF_IOWAIT);
atomic_inc(&rq->nr_iowait);
schedule();
atomic_dec(&rq->nr_iowait);
+ add_io_delay(dstart);
}
EXPORT_SYMBOL(io_schedule);
{
struct runqueue *rq = this_rq();
long ret;
+ def_delay_var(dstart);
+ start_delay_set(dstart,PF_IOWAIT);
atomic_inc(&rq->nr_iowait);
ret = schedule_timeout(timeout);
atomic_dec(&rq->nr_iowait);
+ add_io_delay(dstart);
return ret;
}
}
#endif /* CONFIG_MAGIC_SYSRQ */
+
+#ifdef CONFIG_DELAY_ACCT
+int task_running_sys(struct task_struct *p)
+{
+ return task_is_running(p);
+}
+EXPORT_SYMBOL_GPL(task_running_sys);
+#endif
+
#include <linux/sysctl.h>
#include <linux/cpu.h>
#include <linux/nodemask.h>
+#include <linux/ckrm_mem_inline.h>
#include <asm/tlbflush.h>
/* have to delete it as __free_pages_bulk list manipulates */
list_del(&page->lru);
__free_pages_bulk(page, base, zone, area, order);
+ ckrm_clear_page_class(page);
ret++;
}
spin_unlock_irqrestore(&zone->lock, flags);
1 << PG_referenced | 1 << PG_arch_1 |
1 << PG_checked | 1 << PG_mappedtodisk);
page->private = 0;
+ ckrm_page_init(page);
set_page_refs(page, order);
}
*/
can_try_harder = (unlikely(rt_task(p)) && !in_interrupt()) || !wait;
+ if (!in_interrupt() && !ckrm_class_limit_ok(ckrm_get_mem_class(p)))
+ return NULL;
+
zones = zonelist->zones; /* the list of zones suitable for gfp_mask */
if (unlikely(zones[0] == NULL)) {
}
printk(KERN_DEBUG " %s zone: %lu pages, LIFO batch:%lu\n",
zone_names[j], realsize, batch);
- INIT_LIST_HEAD(&zone->active_list);
- INIT_LIST_HEAD(&zone->inactive_list);
+ ckrm_init_lists(zone);
zone->nr_scan_active = 0;
zone->nr_scan_inactive = 0;
zone->nr_active = 0;
#include <linux/cpu.h>
#include <linux/notifier.h>
#include <linux/init.h>
+#include <linux/ckrm_mem_inline.h>
/* How many pages do we try to swap or page in/out together? */
int page_cluster;
spin_lock_irqsave(&zone->lru_lock, flags);
if (PageLRU(page) && !PageActive(page)) {
list_del(&page->lru);
- list_add_tail(&page->lru, &zone->inactive_list);
+ ckrm_add_tail_inactive(page);
inc_page_state(pgrotated);
}
if (!test_clear_page_writeback(page))
#include <linux/cpu.h>
#include <linux/notifier.h>
#include <linux/rwsem.h>
+#include <linux/ckrm_mem.h>
#include <asm/tlbflush.h>
#include <asm/div64.h>
* For pagecache intensive workloads, the first loop here is the hottest spot
* in the kernel (apart from the copy_*_user functions).
*/
+#ifdef CONFIG_CKRM_RES_MEM
+static void shrink_cache(struct ckrm_zone *ckrm_zone, struct scan_control *sc)
+#else
static void shrink_cache(struct zone *zone, struct scan_control *sc)
+#endif
{
LIST_HEAD(page_list);
struct pagevec pvec;
int max_scan = sc->nr_to_scan;
+#ifdef CONFIG_CKRM_RES_MEM
+ struct zone *zone = ckrm_zone->zone;
+ struct list_head *inactive_list = &ckrm_zone->inactive_list;
+ struct list_head *active_list = &ckrm_zone->active_list;
+#else
+ struct list_head *inactive_list = &zone->inactive_list;
+ struct list_head *active_list = &zone->active_list;
+#endif
pagevec_init(&pvec, 1);
int nr_freed;
while (nr_scan++ < SWAP_CLUSTER_MAX &&
- !list_empty(&zone->inactive_list)) {
- page = lru_to_page(&zone->inactive_list);
+ !list_empty(inactive_list)) {
+ page = lru_to_page(inactive_list);
prefetchw_prev_lru_page(page,
- &zone->inactive_list, flags);
+ inactive_list, flags);
if (!TestClearPageLRU(page))
BUG();
*/
__put_page(page);
SetPageLRU(page);
- list_add(&page->lru, &zone->inactive_list);
+ list_add(&page->lru, inactive_list);
continue;
}
list_add(&page->lru, &page_list);
nr_taken++;
}
zone->nr_inactive -= nr_taken;
+ ckrm_zone_sub_inactive(ckrm_zone, nr_taken);
spin_unlock_irq(&zone->lru_lock);
if (nr_taken == 0)
if (TestSetPageLRU(page))
BUG();
list_del(&page->lru);
- if (PageActive(page))
- add_page_to_active_list(zone, page);
- else
- add_page_to_inactive_list(zone, page);
+ if (PageActive(page)) {
+ ckrm_zone_add_active(ckrm_zone, 1);
+ zone->nr_active++;
+ list_add(&page->lru, active_list);
+ } else {
+ ckrm_zone_add_inactive(ckrm_zone, 1);
+ zone->nr_inactive++;
+ list_add(&page->lru, inactive_list);
+ }
if (!pagevec_add(&pvec, page)) {
spin_unlock_irq(&zone->lru_lock);
__pagevec_release(&pvec);
* But we had to alter page->flags anyway.
*/
static void
+#ifdef CONFIG_CKRM_RES_MEM
+refill_inactive_zone(struct ckrm_zone *ckrm_zone, struct scan_control *sc)
+#else
refill_inactive_zone(struct zone *zone, struct scan_control *sc)
+#endif
{
int pgmoved;
int pgdeactivate = 0;
long mapped_ratio;
long distress;
long swap_tendency;
+#ifdef CONFIG_CKRM_RES_MEM
+ struct zone *zone = ckrm_zone->zone;
+ struct list_head *active_list = &ckrm_zone->active_list;
+ struct list_head *inactive_list = &ckrm_zone->inactive_list;
+#else
+ struct list_head *active_list = &zone->active_list;
+ struct list_head *inactive_list = &zone->inactive_list;
+#endif
lru_add_drain();
pgmoved = 0;
spin_lock_irq(&zone->lru_lock);
- while (pgscanned < nr_pages && !list_empty(&zone->active_list)) {
- page = lru_to_page(&zone->active_list);
- prefetchw_prev_lru_page(page, &zone->active_list, flags);
+ while (pgscanned < nr_pages && !list_empty(active_list)) {
+ page = lru_to_page(active_list);
+ prefetchw_prev_lru_page(page, active_list, flags);
if (!TestClearPageLRU(page))
BUG();
list_del(&page->lru);
*/
__put_page(page);
SetPageLRU(page);
- list_add(&page->lru, &zone->active_list);
+ list_add(&page->lru, active_list);
} else {
list_add(&page->lru, &l_hold);
pgmoved++;
}
zone->pages_scanned += pgscanned;
zone->nr_active -= pgmoved;
+ ckrm_zone_sub_active(ckrm_zone, pgmoved);
spin_unlock_irq(&zone->lru_lock);
/*
BUG();
if (!TestClearPageActive(page))
BUG();
- list_move(&page->lru, &zone->inactive_list);
+ list_move(&page->lru, inactive_list);
pgmoved++;
if (!pagevec_add(&pvec, page)) {
zone->nr_inactive += pgmoved;
+ ckrm_zone_add_inactive(ckrm_zone, pgmoved);
spin_unlock_irq(&zone->lru_lock);
pgdeactivate += pgmoved;
pgmoved = 0;
}
}
zone->nr_inactive += pgmoved;
+ ckrm_zone_add_inactive(ckrm_zone, pgmoved);
pgdeactivate += pgmoved;
if (buffer_heads_over_limit) {
spin_unlock_irq(&zone->lru_lock);
if (TestSetPageLRU(page))
BUG();
BUG_ON(!PageActive(page));
- list_move(&page->lru, &zone->active_list);
+ list_move(&page->lru, active_list);
pgmoved++;
if (!pagevec_add(&pvec, page)) {
zone->nr_active += pgmoved;
+ ckrm_zone_add_active(ckrm_zone, pgmoved);
pgmoved = 0;
spin_unlock_irq(&zone->lru_lock);
__pagevec_release(&pvec);
}
}
zone->nr_active += pgmoved;
+ ckrm_zone_add_active(ckrm_zone, pgmoved);
spin_unlock_irq(&zone->lru_lock);
pagevec_release(&pvec);
mod_page_state(pgdeactivate, pgdeactivate);
}
+#ifdef CONFIG_CKRM_RES_MEM
+static void
+shrink_ckrmzone(struct ckrm_zone *czone, struct scan_control *sc)
+{
+ while (czone->shrink_active || czone->shrink_inactive) {
+ if (czone->shrink_active) {
+ sc->nr_to_scan = min(czone->shrink_active,
+ (unsigned long)SWAP_CLUSTER_MAX);
+ czone->shrink_active -= sc->nr_to_scan;
+ refill_inactive_zone(czone, sc);
+ }
+ if (czone->shrink_inactive) {
+ sc->nr_to_scan = min(czone->shrink_inactive,
+ (unsigned long)SWAP_CLUSTER_MAX);
+ czone->shrink_inactive -= sc->nr_to_scan;
+ shrink_cache(czone, sc);
+ if (sc->nr_to_reclaim <= 0) {
+ czone->shrink_active = 0;
+ czone->shrink_inactive = 0;
+ break;
+ }
+ }
+ }
+}
+
+/* FIXME: This function needs to be given more thought. */
+static void
+ckrm_shrink_class(struct ckrm_mem_res *cls)
+{
+ struct scan_control sc;
+ struct zone *zone;
+ int zindex = 0, cnt, act_credit = 0, inact_credit = 0;
+
+ sc.nr_mapped = read_page_state(nr_mapped);
+ sc.nr_scanned = 0;
+ sc.nr_reclaimed = 0;
+ sc.priority = 0; // always very high priority
+
+ for_each_zone(zone) {
+ int zone_total, zone_limit, active_limit,
+ inactive_limit, clszone_limit;
+ struct ckrm_zone *czone;
+ u64 temp;
+
+ czone = &cls->ckrm_zone[zindex];
+ if (ckrm_test_set_shrink(czone))
+ continue;
+
+ zone->temp_priority = zone->prev_priority;
+ zone->prev_priority = sc.priority;
+
+ zone_total = zone->nr_active + zone->nr_inactive
+ + zone->free_pages;
+
+ temp = (u64) cls->pg_limit * zone_total;
+ do_div(temp, ckrm_tot_lru_pages);
+ zone_limit = (int) temp;
+ clszone_limit = (ckrm_mem_shrink_to * zone_limit) / 100;
+ active_limit = (2 * clszone_limit) / 3; // 2/3rd in active list
+ inactive_limit = clszone_limit / 3; // 1/3rd in inactive list
+
+ czone->shrink_active = 0;
+ cnt = czone->nr_active + act_credit - active_limit;
+ if (cnt > 0) {
+ czone->shrink_active = (unsigned long) cnt;
+ act_credit = 0;
+ } else {
+ act_credit += cnt;
+ }
+
+ czone->shrink_inactive = 0;
+ cnt = czone->shrink_active + inact_credit +
+ (czone->nr_inactive - inactive_limit);
+ if (cnt > 0) {
+ czone->shrink_inactive = (unsigned long) cnt;
+ inact_credit = 0;
+ } else {
+ inact_credit += cnt;
+ }
+
+
+ if (czone->shrink_active || czone->shrink_inactive) {
+ sc.nr_to_reclaim = czone->shrink_inactive;
+ shrink_ckrmzone(czone, &sc);
+ }
+ zone->prev_priority = zone->temp_priority;
+ zindex++;
+ ckrm_clear_shrink(czone);
+ }
+}
+
+static void
+ckrm_shrink_classes(void)
+{
+ struct ckrm_mem_res *cls;
+
+ spin_lock(&ckrm_mem_lock);
+ while (!ckrm_shrink_list_empty()) {
+ cls = list_entry(ckrm_shrink_list.next, struct ckrm_mem_res,
+ shrink_list);
+ list_del(&cls->shrink_list);
+ cls->flags &= ~CLS_AT_LIMIT;
+ spin_unlock(&ckrm_mem_lock);
+ ckrm_shrink_class(cls);
+ spin_lock(&ckrm_mem_lock);
+ }
+ spin_unlock(&ckrm_mem_lock);
+}
+
+#else
+#define ckrm_shrink_classes() do { } while(0)
+#endif
+
/*
* This is a basic per-zone page freer. Used by both kswapd and direct reclaim.
*/
{
unsigned long nr_active;
unsigned long nr_inactive;
+#ifdef CONFIG_CKRM_RES_MEM
+ struct ckrm_zone *czone;
+#endif
+
/*
* Add one to `nr_to_scan' just to make sure that the kernel will
sc->nr_to_reclaim = SWAP_CLUSTER_MAX;
+#ifdef CONFIG_CKRM_RES_MEM
+ if (nr_active || nr_inactive) {
+ struct list_head *pos, *next;
+ LIST_HEAD(victims);
+
+ shrink_get_victims(zone, nr_active, nr_inactive, &victims);
+ pos = victims.next;
+ while (pos != &victims) {
+ czone = list_entry(pos, struct ckrm_zone, victim_list);
+ next = pos->next;
+ list_del_init(pos);
+ sc->nr_to_reclaim = czone->shrink_inactive;
+ shrink_ckrmzone(czone, sc);
+ ckrm_clear_shrink(czone);
+ pos = next;
+ }
+ }
+#else
while (nr_active || nr_inactive) {
if (nr_active) {
sc->nr_to_scan = min(nr_active,
break;
}
}
+#endif
}
/*
schedule();
finish_wait(&pgdat->kswapd_wait, &wait);
- balance_pgdat(pgdat, 0);
+ if (!ckrm_shrink_list_empty())
+ ckrm_shrink_classes();
+ else
+ balance_pgdat(pgdat, 0);
}
return 0;
}
#include <linux/fs.h>
#include <linux/random.h>
+#ifdef CONFIG_CKRM
+#include <linux/ckrm_events.h>
+#endif
+
#include <net/icmp.h>
#include <net/tcp.h>
#include <net/xfrm.h>
int tcp_listen_start(struct sock *sk)
{
+#ifdef CONFIG_ACCEPT_QUEUES
+ int i = 0;
+#endif
struct inet_opt *inet = inet_sk(sk);
struct tcp_opt *tp = tcp_sk(sk);
struct tcp_listen_opt *lopt;
sk->sk_max_ack_backlog = 0;
sk->sk_ack_backlog = 0;
+#ifdef CONFIG_ACCEPT_QUEUES
+ tp->accept_queue = NULL;
+#else
tp->accept_queue = tp->accept_queue_tail = NULL;
+#endif
rwlock_init(&tp->syn_wait_lock);
tcp_delack_init(tp);
break;
get_random_bytes(&lopt->hash_rnd, 4);
+#ifdef CONFIG_ACCEPT_QUEUES
+ tp->class_index = 0;
+ for (i=0; i < NUM_ACCEPT_QUEUES; i++) {
+ tp->acceptq[i].aq_tail = NULL;
+ tp->acceptq[i].aq_head = NULL;
+ tp->acceptq[i].aq_wait_time = 0;
+ tp->acceptq[i].aq_qcount = 0;
+ tp->acceptq[i].aq_count = 0;
+ if (i == 0) {
+ tp->acceptq[i].aq_ratio = 1;
+ }
+ else {
+ tp->acceptq[i].aq_ratio = 0;
+ }
+ }
+#endif
+
write_lock_bh(&tp->syn_wait_lock);
tp->listen_opt = lopt;
write_unlock_bh(&tp->syn_wait_lock);
sk_dst_reset(sk);
sk->sk_prot->hash(sk);
+#ifdef CONFIG_CKRM
+ ckrm_cb_listen_start(sk);
+#endif
+
return 0;
}
write_lock_bh(&tp->syn_wait_lock);
tp->listen_opt = NULL;
write_unlock_bh(&tp->syn_wait_lock);
- tp->accept_queue = tp->accept_queue_tail = NULL;
+
+#ifdef CONFIG_CKRM
+ ckrm_cb_listen_stop(sk);
+#endif
+
+#ifdef CONFIG_ACCEPT_QUEUES
+ for (i = 0; i < NUM_ACCEPT_QUEUES; i++)
+ tp->acceptq[i].aq_head = tp->acceptq[i].aq_tail = NULL;
+#else
+ tp->accept_queue_tail = NULL;
+#endif
+ tp->accept_queue = NULL;
if (lopt->qlen) {
for (i = 0; i < TCP_SYNQ_HSIZE; i++) {
local_bh_enable();
sock_put(child);
+#ifdef CONFIG_ACCEPT_QUEUES
+ sk_acceptq_removed(sk, req->acceptq_class);
+#else
sk_acceptq_removed(sk);
+#endif
tcp_openreq_fastfree(req);
}
BUG_TRAP(!sk->sk_ack_backlog);
struct open_request *req;
struct sock *newsk;
int error;
+#ifdef CONFIG_ACCEPT_QUEUES
+ int prev_class = 0;
+ int first;
+#endif
lock_sock(sk);
goto out;
}
+#ifndef CONFIG_ACCEPT_QUEUES
req = tp->accept_queue;
if ((tp->accept_queue = req->dl_next) == NULL)
tp->accept_queue_tail = NULL;
-
newsk = req->sk;
sk_acceptq_removed(sk);
+#else
+ first = tp->class_index;
+ /* We should always have request queued here. The accept_queue
+ * is already checked for NULL above.
+ */
+ while(!tp->acceptq[first].aq_head) {
+ tp->acceptq[first].aq_cnt = 0;
+ first = (first+1) & ~NUM_ACCEPT_QUEUES;
+ }
+ req = tp->acceptq[first].aq_head;
+ tp->acceptq[first].aq_qcount--;
+ tp->acceptq[first].aq_count++;
+ tp->acceptq[first].aq_wait_time+=(jiffies - req->acceptq_time_stamp);
+
+ for (prev_class= first-1 ; prev_class >=0; prev_class--)
+ if (tp->acceptq[prev_class].aq_tail)
+ break;
+ if (prev_class>=0)
+ tp->acceptq[prev_class].aq_tail->dl_next = req->dl_next;
+ else
+ tp->accept_queue = req->dl_next;
+
+ if (req == tp->acceptq[first].aq_tail)
+ tp->acceptq[first].aq_head = tp->acceptq[first].aq_tail = NULL;
+ else
+ tp->acceptq[first].aq_head = req->dl_next;
+
+ if((++(tp->acceptq[first].aq_cnt)) >= tp->acceptq[first].aq_ratio){
+ tp->acceptq[first].aq_cnt = 0;
+ tp->class_index = ++first & (NUM_ACCEPT_QUEUES-1);
+ }
+ newsk = req->sk;
+ sk_acceptq_removed(sk, req->acceptq_class);
+#endif
tcp_openreq_fastfree(req);
BUG_TRAP(newsk->sk_state != TCP_SYN_RECV);
release_sock(sk);
}
break;
+#ifdef CONFIG_ACCEPT_QUEUES
+ case TCP_ACCEPTQ_SHARE:
+#ifdef CONFIG_CKRM
+ // If CKRM is set then the shares are set through rcfs.
+ // Get shares will still succeed.
+ err = -EOPNOTSUPP;
+ break;
+#else
+ {
+ char share_wt[NUM_ACCEPT_QUEUES];
+ int i,j;
+
+ if (sk->sk_state != TCP_LISTEN)
+ return -EOPNOTSUPP;
+
+ if (copy_from_user(share_wt,optval, optlen)) {
+ err = -EFAULT;
+ break;
+ }
+ j = 0;
+ for (i = 0; i < NUM_ACCEPT_QUEUES; i++) {
+ if (share_wt[i]) {
+ if (!j)
+ j = share_wt[i];
+ else if (share_wt[i] < j) {
+ j = share_wt[i];
+ }
+ }
+ else
+ tp->acceptq[i].aq_ratio = 0;
+
+ }
+ if (j == 0) {
+ /* Class 0 is always valid. If nothing is
+ * specified set class 0 as 1.
+ */
+ share_wt[0] = 1;
+ j = 1;
+ }
+ for (i=0; i < NUM_ACCEPT_QUEUES; i++) {
+ tp->acceptq[i].aq_ratio = share_wt[i]/j;
+ tp->acceptq[i].aq_cnt = 0;
+ }
+ }
+ break;
+#endif
+#endif
default:
err = -ENOPROTOOPT;
break;
case TCP_QUICKACK:
val = !tp->ack.pingpong;
break;
+
+#ifdef CONFIG_ACCEPT_QUEUES
+ case TCP_ACCEPTQ_SHARE:
+ {
+ struct tcp_acceptq_info tinfo[NUM_ACCEPT_QUEUES];
+ int i;
+
+ if (sk->sk_state != TCP_LISTEN)
+ return -EOPNOTSUPP;
+
+ if (get_user(len, optlen))
+ return -EFAULT;
+
+ memset(tinfo, 0, sizeof(tinfo));
+
+ for(i=0; i < NUM_ACCEPT_QUEUES; i++) {
+ tinfo[i].acceptq_wait_time =
+ jiffies_to_msecs(tp->acceptq[i].aq_wait_time);
+ tinfo[i].acceptq_qcount = tp->acceptq[i].aq_qcount;
+ tinfo[i].acceptq_count = tp->acceptq[i].aq_count;
+ tinfo[i].acceptq_shares=tp->acceptq[i].aq_ratio;
+ }
+
+ len = min_t(unsigned int, len, sizeof(tinfo));
+ if (put_user(len, optlen))
+ return -EFAULT;
+
+ if (copy_to_user(optval, (char *)tinfo, len))
+ return -EFAULT;
+
+ return 0;
+ }
+ break;
+#endif
default:
return -ENOPROTOOPT;
};
}
/* Optimize the common listener case. */
-static inline struct sock *tcp_v4_lookup_listener(u32 daddr,
+inline struct sock *tcp_v4_lookup_listener(u32 daddr,
unsigned short hnum, int dif)
{
struct sock *sk = NULL;
lopt->syn_table[h] = req;
write_unlock(&tp->syn_wait_lock);
+#ifdef CONFIG_ACCEPT_QUEUES
+ tcp_synq_added(sk, req);
+#else
tcp_synq_added(sk);
+#endif
}
__u32 daddr = skb->nh.iph->daddr;
__u32 isn = TCP_SKB_CB(skb)->when;
struct dst_entry *dst = NULL;
+#ifdef CONFIG_ACCEPT_QUEUES
+ int class = 0;
+#endif
#ifdef CONFIG_SYN_COOKIES
int want_cookie = 0;
#else
goto drop;
}
+#ifdef CONFIG_ACCEPT_QUEUES
+ class = (skb->nfmark <= 0) ? 0 :
+ ((skb->nfmark >= NUM_ACCEPT_QUEUES) ? 0: skb->nfmark);
+ /*
+ * Accept only if the class has shares set or if the default class
+ * i.e. class 0 has shares
+ */
+ if (!(tcp_sk(sk)->acceptq[class].aq_ratio)) {
+ if (tcp_sk(sk)->acceptq[0].aq_ratio)
+ class = 0;
+ else
+ goto drop;
+ }
+#endif
+
/* Accept backlog is full. If we have already queued enough
* of warm entries in syn queue, drop request. It is better than
* clogging syn queue with openreqs with exponentially increasing
* timeout.
*/
+#ifdef CONFIG_ACCEPT_QUEUES
+ if (sk_acceptq_is_full(sk, class) && tcp_synq_young(sk, class) > 1)
+#else
if (sk_acceptq_is_full(sk) && tcp_synq_young(sk) > 1)
+#endif
goto drop;
req = tcp_openreq_alloc();
tp.tstamp_ok = tp.saw_tstamp;
tcp_openreq_init(req, &tp, skb);
-
+#ifdef CONFIG_ACCEPT_QUEUES
+ req->acceptq_class = class;
+ req->acceptq_time_stamp = jiffies;
+#endif
req->af.v4_req.loc_addr = daddr;
req->af.v4_req.rmt_addr = saddr;
req->af.v4_req.opt = tcp_v4_save_options(sk, skb);
struct tcp_opt *newtp;
struct sock *newsk;
+#ifdef CONFIG_ACCEPT_QUEUES
+ if (sk_acceptq_is_full(sk, req->acceptq_class))
+#else
if (sk_acceptq_is_full(sk))
+#endif
goto exit_overflow;
if (!dst && (dst = tcp_v4_route_req(sk, req)) == NULL)
EXPORT_SYMBOL(tcp_put_port);
EXPORT_SYMBOL(tcp_unhash);
EXPORT_SYMBOL(tcp_v4_conn_request);
+EXPORT_SYMBOL(tcp_v4_lookup_listener);
EXPORT_SYMBOL(tcp_v4_connect);
EXPORT_SYMBOL(tcp_v4_do_rcv);
EXPORT_SYMBOL(tcp_v4_rebuild_header);
newtp->num_sacks = 0;
newtp->urg_data = 0;
newtp->listen_opt = NULL;
+#ifdef CONFIG_ACCEPT_QUEUES
+ newtp->accept_queue = NULL;
+ memset(newtp->acceptq, 0,sizeof(newtp->acceptq));
+ newtp->class_index = 0;
+
+#else
newtp->accept_queue = newtp->accept_queue_tail = NULL;
+#endif
/* Deinitialize syn_wait_lock to trap illegal accesses. */
memset(&newtp->syn_wait_lock, 0, sizeof(newtp->syn_wait_lock));
* ones are about to clog our table.
*/
if (lopt->qlen>>(lopt->max_qlen_log-1)) {
+#ifdef CONFIG_ACCEPT_QUEUES
+ int young = 0;
+
+ for(i=0; i < NUM_ACCEPT_QUEUES; i++)
+ young += lopt->qlen_young[i];
+
+ young <<= 1;
+#else
int young = (lopt->qlen_young<<1);
+#endif
while (thresh > 2) {
if (lopt->qlen < young)
unsigned long timeo;
if (req->retrans++ == 0)
+#ifdef CONFIG_ACCEPT_QUEUES
+ lopt->qlen_young[req->acceptq_class]--;
+#else
lopt->qlen_young--;
- timeo = min((TCP_TIMEOUT_INIT << req->retrans),
- TCP_RTO_MAX);
+#endif
+ timeo = min((TCP_TIMEOUT_INIT << req->retrans), TCP_RTO_MAX);
req->expires = now + timeo;
reqp = &req->dl_next;
continue;
write_unlock(&tp->syn_wait_lock);
lopt->qlen--;
if (req->retrans == 0)
+#ifdef CONFIG_ACCEPT_QUEUES
+ lopt->qlen_young[req->acceptq_class]--;
+#else
lopt->qlen_young--;
+#endif
tcp_openreq_free(req);
continue;
}
lopt->syn_table[h] = req;
write_unlock(&tp->syn_wait_lock);
+#ifdef CONFIG_ACCEPT_QUEUES
+ tcp_synq_added(sk, req);
+#else
tcp_synq_added(sk);
+#endif
}
struct tcp_opt tmptp, *tp = tcp_sk(sk);
struct open_request *req = NULL;
__u32 isn = TCP_SKB_CB(skb)->when;
+#ifdef CONFIG_ACCEPT_QUEUES
+ int class = 0;
+#endif
if (skb->protocol == htons(ETH_P_IP))
return tcp_v4_conn_request(sk, skb);
goto drop;
}
+#ifdef CONFIG_ACCEPT_QUEUES
+ class = (skb->nfmark <= 0) ? 0 :
+ ((skb->nfmark >= NUM_ACCEPT_QUEUES) ? 0: skb->nfmark);
+ /*
+ * Accept only if the class has shares set or if the default class
+ * i.e. class 0 has shares
+ */
+ if (!(tcp_sk(sk)->acceptq[class].aq_ratio)) {
+ if (tcp_sk(sk)->acceptq[0].aq_ratio)
+ class = 0;
+ else
+ goto drop;
+ }
+
+ if (sk_acceptq_is_full(sk, class) && tcp_synq_young(sk, class) > 1)
+#else
if (sk_acceptq_is_full(sk) && tcp_synq_young(sk) > 1)
+#endif
goto drop;
+
req = tcp_openreq_alloc();
if (req == NULL)
goto drop;
tmptp.tstamp_ok = tmptp.saw_tstamp;
tcp_openreq_init(req, &tmptp, skb);
-
+#ifdef CONFIG_ACCEPT_QUEUES
+ req->acceptq_class = class;
+ req->acceptq_time_stamp = jiffies;
+#endif
req->class = &or_ipv6;
ipv6_addr_copy(&req->af.v6_req.rmt_addr, &skb->nh.ipv6h->saddr);
ipv6_addr_copy(&req->af.v6_req.loc_addr, &skb->nh.ipv6h->daddr);
opt = np->opt;
+#ifdef CONFIG_ACCEPT_QUEUES
+ if (sk_acceptq_is_full(sk, req->acceptq_class))
+#else
if (sk_acceptq_is_full(sk))
+#endif
goto out_overflow;
if (np->rxopt.bits.srcrt == 2 &&