Web Analytics
Privacy Policy Cookie Policy Terms and Conditions Linux Kernel 2.4 Internals: IPC mechanisms Next Previous Contents

5. IPC mechanisms

This chapter describes the semaphore, shared memory, and message queue IPC mechanisms as implemented in the Linux 2.4 kernel. It is organized into four sections. The first three sections cover the interfaces and support functions for semaphores, message queues, and shared memory respectively. The last section describes a set of common functions and data structures that are shared by all three mechanisms.

5.1 Semaphores

The functions described in this section implement the user level semaphore mechanisms. Note that this implementation relies on the use of kernel splinlocks and kernel semaphores. To avoid confusion, the term "kernel semaphore" will be used in reference to kernel semaphores. All other uses of the word "sempahore" will be in reference to the user level semaphores.

Semaphore System Call Interfaces

sys_semget()

The entire call to sys_semget() is protected by the global sem_ids.sem kernel semaphore.

In the case where a new set of semaphores must be created, the newary() function is called to create and initialize a new semaphore set. The ID of the new set is returned to the caller.

In the case where a key value is provided for an existing semaphore set, ipc_findkey() is invoked to look up the corresponding semaphore descriptor array index. The parameters and permissions of the caller are verified before returning the semaphore set ID.

sys_semctl()

For the IPC_INFO, SEM_INFO, and SEM_STAT commands, semctl_nolock() is called to perform the necessary functions.

For the GETALL, GETVAL, GETPID, GETNCNT, GETZCNT, IPC_STAT, SETVAL,and SETALL commands, semctl_main() is called to perform the necessary functions.

For the IPC_RMID and IPC_SET command, semctl_down() is called to perform the necessary functions. Throughout both of these operations, the global sem_ids.sem kernel semaphore is held.

sys_semop()

After validating the call parameters, the semaphore operations data is copied from user space to a temporary buffer. If a small temporary buffer is sufficient, then a stack buffer is used. Otherwise, a larger buffer is allocated. After copying in the semaphore operations data, the global semaphores spinlock is locked, and the user-specified semaphore set ID is validated. Access permissions for the semaphore set are also validated.

All of the user-specified semaphore operations are parsed. During this process, a count is maintained of all the operations that have the SEM_UNDO flag set. A decrease flag is set if any of the operations subtract from a semaphore value, and an alter flag is set if any of the semaphore values are modified (i.e. increased or decreased). The number of each semaphore to be modified is validated.

If SEM_UNDO was asserted for any of the semaphore operations, then the undo list for the current task is searched for an undo structure associated with this semaphore set. During this search, if the semaphore set ID of any of the undo structures is found to be -1, then freeundos() is called to free the undo structure and remove it from the list. If no undo structure is found for this semaphore set then alloc_undo() is called to allocate and initialize one.

The try_atomic_semop() function is called with the do_undo parameter equal to 0 in order to execute the sequence of operations. The return value indicates that either the operations passed, failed, or were not executed because they need to block. Each of these cases are further described below:

Non-blocking Semaphore Operations

The try_atomic_semop() function returns zero to indicate that all operations in the sequence succeeded. In this case, update_queue() is called to traverse the queue of pending semaphore operations for the semaphore set and awaken any sleeping tasks that no longer need to block. This completes the execution of the sys_semop() system call for this case.

Failing Semaphore Operations

If try_atomic_semop() returns a negative value, then a failure condition was encountered. In this case, none of the operations have been executed. This occurs when either a semaphore operation would cause an invalid semaphore value, or an operation marked IPC_NOWAIT is unable to complete. The error condition is then returned to the caller of sys_semop().

Before sys_semop() returns, a call is made to update_queue() to traverse the queue of pending semaphore operations for the semaphore set and awaken any sleeping tasks that no longer need to block.

Blocking Semaphore Operations

The try_atomic_semop() function returns 1 to indicate that the sequence of semaphore operations was not executed because one of the semaphores would block. For this case, a new sem_queue element is initialized containing these semaphore operations. If any of these operations would alter the state of the semaphore, then the new queue element is added at the tail of the queue. Otherwise, the new queue element is added at the head of the queue.

The semsleeping element of the current task is set to indicate that the task is sleeping on this sem_queue element. The current task is marked as TASK_INTERRUPTIBLE, and the sleeper element of the sem_queue is set to identify this task as the sleeper. The global semaphore spinlock is then unlocked, and schedule() is called to put the current task to sleep.

When awakened, the task re-locks the global semaphore spinlock, determines why it was awakened, and how it should respond. The following cases are handled:

Semaphore Specific Support Structures

The following structures are used specifically for semaphore support:

struct sem_array


/* One sem_array data structure for each set of semaphores in the system. */
struct sem_array {
    struct kern_ipc_perm sem_perm; /* permissions .. see ipc.h */
    time_t sem_otime; /* last semop time */
    time_t sem_ctime; /* last change time */
    struct sem *sem_base; /* ptr to first semaphore in array */
    struct sem_queue *sem_pending; /* pending operations to be processed */
    struct sem_queue **sem_pending_last; /* last pending operation */
    struct sem_undo *undo; /* undo requests on this array * /
    unsigned long sem_nsems; /* no. of semaphores in array */
};

struct sem


/* One semaphore structure for each semaphore in the system. */
struct sem {
        int     semval;         /* current value */
        int     sempid;         /* pid of last operation */
};

struct seminfo


struct  seminfo {
        int semmap;
        int semmni;
        int semmns;
        int semmnu;
        int semmsl;
        int semopm;
        int semume;
        int semusz;
        int semvmx;
        int semaem;
};

struct semid64_ds


struct semid64_ds {
        struct ipc64_perm sem_perm;             /* permissions .. see
ipc.h */
        __kernel_time_t sem_otime;              /* last semop time */
        unsigned long   __unused1;
        __kernel_time_t sem_ctime;              /* last change time */
        unsigned long   __unused2;
        unsigned long   sem_nsems;              /* no. of semaphores in
array */
        unsigned long   __unused3;
        unsigned long   __unused4;
};

struct sem_queue


/* One queue for each sleeping process in the system. */
struct sem_queue {
        struct sem_queue *      next;    /* next entry in the queue */
        struct sem_queue **     prev;    /* previous entry in the queue, *(q->pr
ev) == q */
        struct task_struct*     sleeper; /* this process */
        struct sem_undo *       undo;    /* undo structure */
        int                     pid;     /* process id of requesting process */
        int                     status;  /* completion status of operation */
        struct sem_array *      sma;     /* semaphore array for operations */
        int                     id;      /* internal sem id */
        struct sembuf *         sops;    /* array of pending operations */
        int                     nsops;   /* number of operations */
        int                     alter;   /* operation will alter semaphore */
};

struct sembuf


/* semop system calls takes an array of these. */
struct sembuf {
        unsigned short  sem_num;        /* semaphore index in array */
        short           sem_op;         /* semaphore operation */
        short           sem_flg;        /* operation flags */
};

struct sem_undo


/* Each task has a list of undo requests. They are executed automatically
 * when the process exits.
 */
struct sem_undo {
        struct sem_undo *       proc_next;      /* next entry on this process */
        struct sem_undo *       id_next;        /* next entry on this semaphore set */
        int                     semid;          /* semaphore set identifier */
        short *                 semadj;         /* array of adjustments, one per
 semaphore */
};

Semaphore Support Functions

The following functions are used specifically in support of semaphores:

newary()

newary() relies on the ipc_alloc() function to allocate the memory required for the new semaphore set. It allocates enough memory for the semaphore set descriptor and for each of the semaphores in the set. The allocated memory is cleared, and the address of the first element of the semaphore set descriptor is passed to ipc_addid(). ipc_addid() reserves an array entry for the new semaphore set descriptor and initializes the ( struct kern_ipc_perm) data for the set. The global used_sems variable is updated by the number of semaphores in the new set and the initialization of the ( struct kern_ipc_perm) data for the new set is completed. Other initialization for this set performed are listed below:

All of the operations following the call to ipc_addid() are performed while holding the global semaphores spinlock. After unlocking the global semaphores spinlock, newary() calls ipc_buildid() (via sem_buildid()). This function uses the index of the semaphore set descriptor to create a unique ID, that is then returned to the caller of newary().

freeary()

freeary() is called by semctl_down() to perform the functions listed below. It is called with the global semaphores spinlock locked and it returns with the spinlock unlocked

semctl_down()

semctl_down() provides the IPC_RMID and IPC_SET operations of the semctl() system call. The semaphore set ID and the access permissions are verified prior to either of these operations, and in either case, the global semaphore spinlock is held throughout the operation.

IPC_RMID

The IPC_RMID operation calls freeary() to remove the semaphore set.

IPC_SET

The IPC_SET operation updates the uid, gid, mode, and ctime elements of the semaphore set.

semctl_nolock()

semctl_nolock() is called by sys_semctl() to perform the IPC_INFO, SEM_INFO and SEM_STAT functions.

IPC_INFO and SEM_INFO

IPC_INFO and SEM_INFO cause a temporary seminfo buffer to be initialized and loaded with unchanging semaphore statistical data. Then, while holding the global sem_ids.sem kernel semaphore, the semusz and semaem elements of the seminfo structure are updated according to the given command (IPC_INFO or SEM_INFO). The return value of the system call is set to the maximum semaphore set ID.

SEM_STAT

SEM_STAT causes a temporary semid64_ds buffer to be initialized. The global semaphore spinlock is then held while copying the sem_otime, sem_ctime, and sem_nsems values into the buffer. This data is then copied to user space.

semctl_main()

semctl_main() is called by sys_semctl() to perform many of the supported functions, as described in the subsections below. Prior to performing any of the following operations, semctl_main() locks the global semaphore spinlock and validates the semaphore set ID and the permissions. The spinlock is released before returning.

GETALL

The GETALL operation loads the current semaphore values into a temporary kernel buffer and copies them out to user space. The small stack buffer is used if the semaphore set is small. Otherwise, the spinlock is temporarily dropped in order to allocate a larger buffer. The spinlock is held while copying the semaphore values in to the temporary buffer.

SETALL

The SETALL operation copies semaphore values from user space into a temporary buffer, and then into the semaphore set. The spinlock is dropped while copying the values from user space into the temporary buffer, and while verifying reasonable values. If the semaphore set is small, then a stack buffer is used, otherwise a larger buffer is allocated. The spinlock is regained and held while the following operations are performed on the semaphore set:

IPC_STAT

In the IPC_STAT operation, the sem_otime, sem_ctime, and sem_nsems value are copied into a stack buffer. The data is then copied to user space after dropping the spinlock.

GETVAL

For GETVAL in the non-error case, the return value for the system call is set to the value of the specified semaphore.

GETPID

For GETPID in the non-error case, the return value for the system call is set to the pid associated with the last operation on the semaphore.

GETNCNT

For GETNCNT in the non-error case, the return value for the system call is set to the number of processes waiting on the semaphore being less than zero. This number is calculated by the count_semncnt() function.

GETZCNT

For GETZCNT in the non-error case, the return value for the system call is set to the number of processes waiting on the semaphore being set to zero. This number is calculated by the count_semzcnt() function.

SETVAL

After validating the new semaphore value, the following functions are performed:

count_semncnt()

count_semncnt() counts the number of tasks waiting on the value of a semaphore to be less than zero.

count_semzcnt()

count_semzcnt() counts the number of tasks waiting on the value of a semaphore to be zero.

update_queue()

update_queue() traverses the queue of pending semops for a semaphore set and calls try_atomic_semop() to determine which sequences of semaphore operations would succeed. If the status of the queue element indicates that blocked tasks have already been awakened, then the queue element is skipped over. For other elements of the queue, the q-alter flag is passed as the undo parameter to try_atomic_semop(), indicating that any altering operations should be undone before returning.

If the sequence of operations would block, then update_queue() returns without making any changes.

A sequence of operations can fail if one of the semaphore operations would cause an invalid semaphore value, or an operation marked IPC_NOWAIT is unable to complete. In such a case, the task that is blocked on the sequence of semaphore operations is awakened, and the queue status is set with an appropriate error code. The queue element is also dequeued.

If the sequence of operations is non-altering, then they would have passed a zero value as the undo parameter to try_atomic_semop(). If these operations succeeded, then they are considered complete and are removed from the queue. The blocked task is awakened, and the queue element status is set to indicate success.

If the sequence of operations would alter the semaphore values, but can succeed, then sleeping tasks that no longer need to be blocked are awakened. The queue status is set to 1 to indicate that the blocked task has been awakened. The operations have not been performed, so the queue element is not removed from the queue. The semaphore operations would be executed by the awakened task.

try_atomic_semop()

try_atomic_semop() is called by sys_semop() and update_queue() to determine if a sequence of semaphore operations will all succeed. It determines this by attempting to perform each of the operations.

If a blocking operation is encountered, then the process is aborted and all operations are reversed. -EAGAIN is returned if IPC_NOWAIT is set. Otherwise 1 is returned to indicate that the sequence of semaphore operations is blocked.

If a semaphore value is adjusted beyond system limits, then then all operations are reversed, and -ERANGE is returned.

If all operations in the sequence succeed, and the do_undo parameter is non-zero, then all operations are reversed, and 0 is returned. If the do_undo parameter is zero, then all operations succeeded and remain in force, and the sem_otime, field of the semaphore set is updated.

sem_revalidate()

sem_revalidate() is called when the global semaphores spinlock has been temporarily dropped and needs to be locked again. It is called by semctl_main() and alloc_undo(). It validates the semaphore ID and permissions and on success, returns with the global semaphores spinlock locked.

freeundos()

freeundos() traverses the process undo list in search of the desired undo structure. If found, the undo structure is removed from the list and freed. A pointer to the next undo structure on the process list is returned.

alloc_undo()

alloc_undo() expects to be called with the global semaphores spinlock locked. In the case of an error, it returns with it unlocked.

The global semaphores spinlock is unlocked, and kmalloc() is called to allocate sufficient memory for both the sem_undo structure, and also an array of one adjustment value for each semaphore in the set. On success, the global spinlock is regained with a call to sem_revalidate().

The new semundo structure is then initialized, and the address of this structure is placed at the address provided by the caller. The new undo structure is then placed at the head of undo list for the current task.

sem_exit()

sem_exit() is called by do_exit(), and is responsible for executing all of the undo adjustments for the exiting task.

If the current process was blocked on a semaphore, then it is removed from the sem_queue list while holding the global semaphores spinlock.

The undo list for the current task is then traversed, and the following operations are performed while holding and releasing the the global semaphores spinlock around the processing of each element of the list. The following operations are performed for each of the undo elements:

When the processing of the list is complete, the current->semundo value is cleared.

5.2 Message queues

Message System Call Interfaces

sys_msgget()

The entire call to sys_msgget() is protected by the global message queue semaphore ( msg_ids.sem).

In the case where a new message queue must be created, the newque() function is called to create and initialize a new message queue, and the new queue ID is returned to the caller.

If a key value is provided for an existing message queue, then ipc_findkey() is called to look up the corresponding index in the global message queue descriptor array (msg_ids.entries). The parameters and permissions of the caller are verified before returning the message queue ID. The look up operation and verification are performed while the global message queue spinlock(msg_ids.ary) is held.

sys_msgctl()

The parameters passed to sys_msgctl() are: a message queue ID (msqid), the operation (cmd), and a pointer to a user space buffer of type msgid_ds (buf). Six operations are provided in this function: IPC_INFO, MSG_INFO,IPC_STAT, MSG_STAT, IPC_SET and IPC_RMID. The message queue ID and the operation parameters are validated; then, the operation(cmd) is performed as follows:

IPC_INFO ( or MSG_INFO)

The global message queue information is copied to user space.

IPC_STAT ( or MSG_STAT)

A temporary buffer of type struct msqid64_ds is initialized and the global message queue spinlock is locked. After verifying the access permissions of the calling process, the message queue information associated with the message queue ID is loaded into the temporary buffer, the global message queue spinlock is unlocked, and the contents of the temporary buffer are copied out to user space by copy_msqid_to_user().

IPC_SET

The user data is copied in via copy_msqid_to_user(). The global message queue semaphore and spinlock are obtained and released at the end. After the the message queue ID and the current process access permissions are validated, the message queue information is updated with the user provided data. Later, expunge_all() and ss_wakeup() are called to wake up all processes sleeping on the receiver and sender waiting queues of the message queue. This is because some receivers may now be excluded by stricter access permissions and some senders may now be able to send the message due to an increased queue size.

IPC_RMID

The global message queue semaphore is obtained and the global message queue spinlock is locked. After validating the message queue ID and the current task access permissions, freeque() is called to free the resources related to the message queue ID. The global message queue semaphore and spinlock are released.

sys_msgsnd()

sys_msgsnd() receives as parameters a message queue ID (msqid), a pointer to a buffer of type struct msg_msg (msgp), the size of the message to be sent (msgsz), and a flag indicating wait vs. not wait (msgflg). There are two task waiting queues and one message waiting queue associated with the message queue ID. If there is a task in the receiver waiting queue that is waiting for this message, then the message is delivered directly to the receiver, and the receiver is awakened. Otherwise, if there is enough space available in the message waiting queue, the message is saved in this queue. As a last resort, the sending task enqueues itself on the sender waiting queue. A more in-depth discussion of the operations performed by sys_msgsnd() follows:

  1. Validates the user buffer address and the message type, then invokes load_msg() to load the contents of the user message into a temporary object msg of type struct msg_msg. The message type and message size fields of msg are also initialized.
  2. Locks the global message queue spinlock and gets the message queue descriptor associated with the message queue ID. If no such message queue exists, returns EINVAL.
  3. Invokes ipc_checkid() (via msg_checkid())to verify that the message queue ID is valid and calls ipcperms() to check the calling process' access permissions.
  4. Checks the message size and the space left in the message waiting queue to see if there is enough room to store the message. If not, the following substeps are performed:
    1. If IPC_NOWAIT is specified in msgflg the global message queue spinlock is unlocked, the memory resources for the message are freed, and EAGAIN is returned.
    2. Invokes ss_add() to enqueue the current task in the sender waiting queue. It also unlocks the global message queue spinlock and invokes schedule() to put the current task to sleep.
    3. When awakened, obtains the global spinlock again and verifies that the message queue ID is still valid. If the message queue ID is not valid, ERMID is returned.
    4. Invokes ss_del() to remove the sending task from the sender waiting queue. If there is any signal pending for the task, sys_msgsnd() unlocks the global spinlock, invokes free_msg() to free the message buffer, and returns EINTR. Otherwise, the function goes back to check again whether there is enough space in the message waiting queue.
  5. Invokes pipelined_send() to try to send the message to the waiting receiver directly.
  6. If there is no receiver waiting for this message, enqueues msg into the message waiting queue(msq->q_messages). Updates the q_cbytes and the q_qnum fields of the message queue descriptor, as well as the global variables msg_bytes and msg_hdrs, which indicate the total number of bytes used for messages and the total number of messages system wide.
  7. If the message has been successfully sent or enqueued, updates the q_lspid and the q_stime fields of the message queue descriptor and releases the global message queue spinlock.

sys_msgrcv()

The sys_msgrcv() function receives as parameters a message queue ID (msqid), a pointer to a buffer of type msg_msg (msgp), the desired message size(msgsz), the message type (msgtyp), and the flags (msgflg). It searches the message waiting queue associated with the message queue ID, finds the first message in the queue which matches the request type, and copies it into the given user buffer. If no such message is found in the message waiting queue, the requesting task is enqueued into the receiver waiting queue until the desired message is available. A more in-depth discussion of the operations performed by sys_msgrcv() follows:

  1. First, invokes convert_mode() to derive the search mode from msgtyp. sys_msgrcv() then locks the global message queue spinlock and obtains the message queue descriptor associated with the message queue ID. If no such message queue exists, it returns EINVAL.
  2. Checks whether the current task has the correct permissions to access the message queue.
  3. Starting from the first message in the message waiting queue, invokes testmsg() to check whether the message type matches the required type. sys_msgrcv() continues searching until a matched message is found or the whole waiting queue is exhausted. If the search mode is SEARCH_LESSEQUAL, then the first message on the queue with the lowest type less than or equal to msgtyp is searched.
  4. If a message is found, sys_msgrcv() performs the following substeps:
    1. If the message size is larger than the desired size and msgflg indicates no error allowed, unlocks the global message queue spinlock and returns E2BIG.
    2. Removes the message from the message waiting queue and updates the message queue statistics.
    3. Wakes up all tasks sleeping on the senders waiting queue. The removal of a message from the queue in the previous step makes it possible for one of the senders to progress. Goes to the last step
  5. If no message matching the receivers criteria is found in the message waiting queue, then msgflg is checked. If IPC_NOWAIT is set, then the global message queue spinlock is unlocked and ENOMSG is returned. Otherwise, the receiver is enqueued on the receiver waiting queue as follows:
    1. A msg_receiver data structure msr is allocated and is added to the head of waiting queue.
    2. The r_tsk field of msr is set to current task.
    3. The r_msgtype and r_mode fields are initialized with the desired message type and mode respectively.
    4. If msgflg indicates MSG_NOERROR, then the r_maxsize field of msr is set to be the value of msgsz otherwise it is set to be INT_MAX.
    5. The r_msg field is initialized to indicate that no message has been received yet.
    6. After the initialization is complete, the status of the receiving task is set to TASK_INTERRUPTIBLE, the global message queue spinlock is unlocked, and schedule() is invoked.
  6. After the receiver is awakened, the r_msg field of msr is checked. This field is used to store the pipelined message or in the case of an error, to store the error status. If the r_msg field is filled with the desired message, then go to the last step Otherwise, the global message queue spinlock is locked again.
  7. After obtaining the spinlock, the r_msg field is re-checked to see if the message was received while waiting for the spinlock. If the message has been received, the last step occurs.
  8. If the r_msg field remains unchanged, then the task was awakened in order to retry. In this case, msr is dequeued. If there is a signal pending for the task, then the global message queue spinlock is unlocked and EINTR is returned. Otherwise, the function needs to go back and retry.
  9. If the r_msg field shows that an error occurred while sleeping, the global message queue spinlock is unlocked and the error is returned.
  10. After validating that the address of the user buffer msp is valid, message type is loaded into the mtype field of msp,and store_msg() is invoked to copy the message contents to the mtext field of msp. Finally the memory for the message is freed by function free_msg().

Message Specific Structures

Data structures for message queues are defined in msg.c.

struct msg_queue


/* one msq_queue structure for each present queue on the system */
struct msg_queue {
        struct kern_ipc_perm q_perm;
        time_t q_stime;                 /* last msgsnd time */
        time_t q_rtime;                 /* last msgrcv time */
        time_t q_ctime;                 /* last change time */
        unsigned long q_cbytes;         /* current number of bytes on queue */
        unsigned long q_qnum;           /* number of messages in queue */
        unsigned long q_qbytes;         /* max number of bytes on queue */
        pid_t q_lspid;                  /* pid of last msgsnd */
        pid_t q_lrpid;                  /* last receive pid */

        struct list_head q_messages;
        struct list_head q_receivers;
        struct list_head q_senders;
};

struct msg_msg


/* one msg_msg structure for each message */
struct msg_msg {
        struct list_head m_list;
        long  m_type;
        int m_ts;           /* message text size */
        struct msg_msgseg* next;
        /* the actual message follows immediately */
};

struct msg_msgseg


/* message segment for each message */
struct msg_msgseg {
        struct msg_msgseg* next;
        /* the next part of the message follows immediately */
};

struct msg_sender


/* one msg_sender for each sleeping sender */
struct msg_sender {
        struct list_head list;
        struct task_struct* tsk;
};

struct msg_receiver


/* one msg_receiver structure for each sleeping receiver */
struct msg_receiver {
        struct list_head r_list;
        struct task_struct* r_tsk;

        int r_mode;
        long r_msgtype;
        long r_maxsize;

        struct msg_msg* volatile r_msg;
};

struct msqid64_ds


struct msqid64_ds {
        struct ipc64_perm msg_perm;
        __kernel_time_t msg_stime;      /* last msgsnd time */
        unsigned long   __unused1;
        __kernel_time_t msg_rtime;      /* last msgrcv time */
        unsigned long   __unused2;
        __kernel_time_t msg_ctime;      /* last change time */
        unsigned long   __unused3;
        unsigned long  msg_cbytes;      /* current number of bytes on queue */
        unsigned long  msg_qnum;        /* number of messages in queue */
        unsigned long  msg_qbytes;      /* max number of bytes on queue */
        __kernel_pid_t msg_lspid;       /* pid of last msgsnd */
        __kernel_pid_t msg_lrpid;       /* last receive pid */
        unsigned long  __unused4;
        unsigned long  __unused5;
};

struct msqid_ds


 struct msqid_ds {
        struct ipc_perm msg_perm;
        struct msg *msg_first;          /* first message on queue,unused  */
        struct msg *msg_last;           /* last message in queue,unused */
        __kernel_time_t msg_stime;      /* last msgsnd time */
        __kernel_time_t msg_rtime;      /* last msgrcv time */
        __kernel_time_t msg_ctime;      /* last change time */
        unsigned long  msg_lcbytes;     /* Reuse junk fields for 32 bit */
        unsigned long  msg_lqbytes;     /* ditto */
        unsigned short msg_cbytes;      /* current number of bytes on queue */
        unsigned short msg_qnum;        /* number of messages in queue */
        unsigned short msg_qbytes;      /* max number of bytes on queue */
        __kernel_ipc_pid_t msg_lspid;   /* pid of last msgsnd */
        __kernel_ipc_pid_t msg_lrpid;   /* last receive pid */
};

msg_setbuf


struct msq_setbuf {
        unsigned long   qbytes;
        uid_t           uid;
        gid_t           gid;
        mode_t          mode;
};

Message Support Functions

newque()

newque() allocates the memory for a new message queue descriptor ( struct msg_queue) and then calls ipc_addid(), which reserves a message queue array entry for the new message queue descriptor. The message queue descriptor is initialized as follows:

All the operations following the call to ipc_addid() are performed while holding the global message queue spinlock. After unlocking the spinlock, newque() calls msg_buildid(), which maps directly to ipc_buildid(). ipc_buildid() uses the index of the message queue descriptor to create a unique message queue ID that is then returned to the caller of newque().

freeque()

When a message queue is going to be removed, the freeque() function is called. This function assumes that the global message queue spinlock is already locked by the calling function. It frees all kernel resources associated with that message queue. First, it calls ipc_rmid() (via msg_rmid()) to remove the message queue descriptor from the array of global message queue descriptors. Then it calls expunge_all to wake up all receivers and ss_wakeup() to wake up all senders sleeping on this message queue. Later the global message queue spinlock is released. All messages stored in this message queue are freed and the memory for the message queue descriptor is freed.

ss_wakeup()

ss_wakeup() wakes up all the tasks waiting in the given message sender waiting queue. If this function is called by freeque(), then all senders in the queue are dequeued.

ss_add()

ss_add() receives as parameters a message queue descriptor and a message sender data structure. It fills the tsk field of the message sender data structure with the current process, changes the status of current process to TASK_INTERRUPTIBLE, then inserts the message sender data structure at the head of the sender waiting queue of the given message queue.

ss_del()

If the given message sender data structure (mss) is still in the associated sender waiting queue, then ss_del() removes mss from the queue.

expunge_all()

expunge_all() receives as parameters a message queue descriptor(msq) and an integer value (res) indicating the reason for waking up the receivers. For each sleeping receiver associated with msq, the r_msg field is set to the indicated wakeup reason (res), and the associated receiving task is awakened. This function is called when a message queue is removed or a message control operation has been performed.

load_msg()

When a process sends a message, the sys_msgsnd() function first invokes the load_msg() function to load the message from user space to kernel space. The message is represented in kernel memory as a linked list of data blocks. Associated with the first data block is a msg_msg structure that describes the overall message. The datablock associated with the msg_msg structure is limited to a size of DATA_MSG_LEN. The data block and the structure are allocated in one contiguous memory block that can be as large as one page in memory. If the full message will not fit into this first data block, then additional data blocks are allocated and are organized into a linked list. These additional data blocks are limited to a size of DATA_SEG_LEN, and each include an associated msg_msgseg) structure. The msg_msgseg structure and the associated data block are allocated in one contiguous memory block that can be as large as one page in memory. This function returns the address of the new msg_msg structure on success.

store_msg()

The store_msg() function is called by sys_msgrcv() to reassemble a received message into the user space buffer provided by the caller. The data described by the msg_msg structure and any msg_msgseg structures are sequentially copied to the user space buffer.

free_msg()

The free_msg() function releases the memory for a message data structure msg_msg, and the message segments.

convert_mode()

convert_mode() is called by sys_msgrcv(). It receives as parameters the address of the specified message type (msgtyp) and a flag (msgflg). It returns the search mode to the caller based on the value of msgtyp and msgflg. If msgtyp is null, then SEARCH_ANY is returned. If msgtyp is less than 0, then msgtyp is set to it's absolute value and SEARCH_LESSEQUAL is returned. If MSG_EXCEPT is specified in msgflg, then SEARCH_NOTEQUAL is returned. Otherwise SEARCH_EQUAL is returned.

testmsg()

The testmsg() function checks whether a message meets the criteria specified by the receiver. It returns 1 if one of the following conditions is true:

pipelined_send()

pipelined_send() allows a process to directly send a message to a waiting receiver rather than deposit the message in the associated message waiting queue. The testmsg() function is invoked to find the first receiver which is waiting for the given message. If found, the waiting receiver is removed from the receiver waiting queue, and the associated receiving task is awakened. The message is stored in the r_msg field of the receiver, and 1 is returned. In the case where no receiver is waiting for the message, 0 is returned.

In the process of searching for a receiver, potential receivers may be found which have requested a size that is too small for the given message. Such receivers are removed from the queue, and are awakened with an error status of E2BIG, which is stored in the r_msg field. The search then continues until either a valid receiver is found, or the queue is exhausted.

copy_msqid_to_user()

copy_msqid_to_user() copies the contents of a kernel buffer to the user buffer. It receives as parameters a user buffer, a kernel buffer of type msqid64_ds, and a version flag indicating the new IPC version vs. the old IPC version. If the version flag equals IPC_64, then copy_to_user() is invoked to copy from the kernel buffer to the user buffer directly. Otherwise a temporary buffer of type struct msqid_ds is initialized, and the kernel data is translated to this temporary buffer. Later copy_to_user() is called to copy the contents of the the temporary buffer to the user buffer.

copy_msqid_from_user()

The function copy_msqid_from_user() receives as parameters a kernel message buffer of type struct msq_setbuf, a user buffer and a version flag indicating the new IPC version vs. the old IPC version. In the case of the new IPC version, copy_from_user() is called to copy the contents of the user buffer to a temporary buffer of type msqid64_ds. Then, the qbytes,uid, gid, and mode fields of the kernel buffer are filled with the values of the corresponding fields from the temporary buffer. In the case of the old IPC version, a temporary buffer of type struct msqid_ds is used instead.

5.3 Shared Memory

Shared Memory System Call Interfaces

sys_shmget()

The entire call to sys_shmget() is protected by the global shared memory semaphore.

In the case where a new shared memory segment must be created, the newseg() function is called to create and initialize a new shared memory segment. The ID of the new segment is returned to the caller.

In the case where a key value is provided for an existing shared memory segment, the corresponding index in the shared memory descriptors array is looked up, and the parameters and permissions of the caller are verified before returning the shared memory segment ID. The look up operation and verification are performed while the global shared memory spinlock is held.

sys_shmctl()

IPC_INFO

A temporary shminfo64 buffer is loaded with system-wide shared memory parameters and is copied out to user space for access by the calling application.

SHM_INFO

The global shared memory semaphore and the global shared memory spinlock are held while gathering system-wide statistical information for shared memory. The shm_get_stat() function is called to calculate both the number of shared memory pages that are resident in memory and the number of shared memory pages that are swapped out. Other statistics include the total number of shared memory pages and the number of shared memory segments in use. The counts of swap_attempts and swap_successes are hard-coded to zero. These statistics are stored in a temporary shm_info buffer and copied out to user space for the calling application.

SHM_STAT, IPC_STAT

For SHM_STAT and IPC_STATA, a temporary buffer of type struct shmid64_ds is initialized, and the global shared memory spinlock is locked.

For the SHM_STAT case, the shared memory segment ID parameter is expected to be a straight index (i.e. 0 to n where n is the number of shared memory IDs in the system). After validating the index, ipc_buildid() is called (via shm_buildid()) to convert the index into a shared memory ID. In the passing case of SHM_STAT, the shared memory ID will be the return value. Note that this is an undocumented feature, but is maintained for the ipcs(8) program.

For the IPC_STAT case, the shared memory segment ID parameter is expected to be an ID that was generated by a call to shmget(). The ID is validated before proceeding. In the passing case of IPC_STAT, 0 will be the return value.

For both SHM_STAT and IPC_STAT, the access permissions of the caller are verified. The desired statistics are loaded into the temporary buffer and then copied out to the calling application.

SHM_LOCK, SHM_UNLOCK

After validating access permissions, the global shared memory spinlock is locked, and the shared memory segment ID is validated. For both SHM_LOCK and SHM_UNLOCK, shmem_lock() is called to perform the function. The parameters for shmem_lock() identify the function to be performed.

IPC_RMID

During IPC_RMID the global shared memory semaphore and the global shared memory spinlock are held throughout this function. The Shared Memory ID is validated, and then if there are no current attachments, shm_destroy() is called to destroy the shared memory segment. Otherwise, the SHM_DEST flag is set to mark it for destruction, and the IPC_PRIVATE flag is set to prevent other processes from being able to reference the shared memory ID.

IPC_SET

After validating the shared memory segment ID and the user access permissions, the uid, gid, and mode flags of the shared memory segment are updated with the user data. The shm_ctime field is also updated. These changes are made while holding the global shared memory semaphore and the global share memory spinlock.

sys_shmat()

sys_shmat() takes as parameters, a shared memory segment ID, an address at which the shared memory segment should be attached(shmaddr), and flags which will be described below.

If shmaddr is non-zero, and the SHM_RND flag is specified, then shmaddr is rounded down to a multiple of SHMLBA. If shmaddr is not a multiple of SHMLBA and SHM_RND is not specified, then EINVAL is returned.

The access permissions of the caller are validated and the shm_nattch field for the shared memory segment is incremented. Note that this increment guarantees that the attachment count is non-zero and prevents the shared memory segment from being destroyed during the process of attaching to the segment. These operations are performed while holding the global shared memory spinlock.

The do_mmap() function is called to create a virtual memory mapping to the shared memory segment pages. This is done while holding the mmap_sem semaphore of the current task. The MAP_SHARED flag is passed to do_mmap(). If an address was provided by the caller, then the MAP_FIXED flag is also passed to do_mmap(). Otherwise, do_mmap() will select the virtual address at which to map the shared memory segment.

NOTE shm_inc() will be invoked within the do_mmap() function call via the shm_file_operations structure. This function is called to set the PID, to set the current time, and to increment the number of attachments to this shared memory segment.

After the call to do_mmap(), the global shared memory semaphore and the global shared memory spinlock are both obtained. The attachment count is then decremented. The the net change to the attachment count is 1 for a call to shmat() because of the call to shm_inc(). If, after decrementing the attachment count, the resulting count is found to be zero, and if the segment is marked for destruction (SHM_DEST), then shm_destroy() is called to release the shared memory segment resources.

Finally, the virtual address at which the shared memory is mapped is returned to the caller at the user specified address. If an error code had been returned by do_mmap(), then this failure code is passed on as the return value for the system call.

sys_shmdt()

The global shared memory semaphore is held while performing sys_shmdt(). The mm_struct of the current process is searched for the vm_area_struct associated with the shared memory address. When it is found, do_munmap() is called to undo the virtual address mapping for the shared memory segment.

Note also that do_munmap() performs a call-back to shm_close(), which performs the shared-memory book keeping functions, and releases the shared memory segment resources if there are no other attachments.

sys_shmdt() unconditionally returns 0.

Shared Memory Support Structures

struct shminfo64


struct shminfo64 {
        unsigned long   shmmax;
        unsigned long   shmmin;
        unsigned long   shmmni;
        unsigned long   shmseg;
        unsigned long   shmall;
        unsigned long   __unused1;
        unsigned long   __unused2;
        unsigned long   __unused3;
        unsigned long   __unused4;
};

struct shm_info


struct shm_info {
        int used_ids;
        unsigned long shm_tot;  /* total allocated shm */
        unsigned long shm_rss;  /* total resident shm */
        unsigned long shm_swp;  /* total swapped shm */
        unsigned long swap_attempts;
        unsigned long swap_successes;
};

struct shmid_kernel


struct shmid_kernel /* private to the kernel */
{
        struct kern_ipc_perm    shm_perm;
        struct file *           shm_file;
        int                     id;
        unsigned long           shm_nattch;
        unsigned long           shm_segsz;
        time_t                  shm_atim;
        time_t                  shm_dtim;
        time_t                  shm_ctim;
        pid_t                   shm_cprid;
        pid_t                   shm_lprid;
};

struct shmid64_ds


struct shmid64_ds {
        struct ipc64_perm       shm_perm;       /* operation perms */
        size_t                  shm_segsz;      /* size of segment (bytes) */
        __kernel_time_t         shm_atime;      /* last attach time */
        unsigned long           __unused1;
        __kernel_time_t         shm_dtime;      /* last detach time */
        unsigned long           __unused2;
        __kernel_time_t         shm_ctime;      /* last change time */
        unsigned long           __unused3;
        __kernel_pid_t          shm_cpid;       /* pid of creator */
        __kernel_pid_t          shm_lpid;       /* pid of last operator */
        unsigned long           shm_nattch;     /* no. of current attaches */
        unsigned long           __unused4;
        unsigned long           __unused5;
};

struct shmem_inode_info


struct shmem_inode_info {
        spinlock_t      lock;
        unsigned long   max_index;
        swp_entry_t     i_direct[SHMEM_NR_DIRECT]; /* for the first blocks */
        swp_entry_t   **i_indirect; /* doubly indirect blocks */
        unsigned long   swapped;
        int             locked;     /* into memory */
        struct list_head        list;
};

Shared Memory Support Functions

newseg()

The newseg() function is called when a new shared memory segment needs to be created. It acts on three parameters for the new segment the key, the flag, and the size. After validating that the size of the shared memory segment to be created is between SHMMIN and SHMMAX and that the total number of shared memory segments does not exceed SHMALL, it allocates a new shared memory segment descriptor. The shmem_file_setup() function is invoked later to create an unlinked file of type tmpfs. The returned file pointer is saved in the shm_file field of the associated shared memory segment descriptor. The files size is set to be the same as the size of the segment. The new shared memory segment descriptor is initialized and inserted into the global IPC shared memory descriptors array. The shared memory segment ID is created by shm_buildid() (via ipc_buildid()). This segment ID is saved in the id field of the shared memory segment descriptor, as well as in the i_ino field of the associated inode. In addition, the address of the shared memory operations defined in structure shm_file_operation is stored in the associated file. The value of the global variable shm_tot, which indicates the total number of shared memory segments system wide, is also increased to reflect this change. On success, the segment ID is returned to the caller application.

shm_get_stat()

shm_get_stat() cycles through all of the shared memory structures, and calculates the total number of memory pages in use by shared memory and the total number of shared memory pages that are swapped out. There is a file structure and an inode structure for each shared memory segment. Since the required data is obtained via the inode, the spinlock for each inode structure that is accessed is locked and unlocked in sequence.

shmem_lock()

shmem_lock() receives as parameters a pointer to the shared memory segment descriptor and a flag indicating lock vs. unlock.The locking state of the shared memory segment is stored in an associated inode. This state is compared with the desired locking state; shmem_lock() simply returns if they match.

While holding the semaphore of the associated inode, the locking state of the inode is set. The following list of items occur for each page in the shared memory segment:

shm_destroy()

During shm_destroy() the total number of shared memory pages is adjusted to account for the removal of the shared memory segment. ipc_rmid() is called (via shm_rmid()) to remove the Shared Memory ID. shmem_lock is called to unlock the shared memory pages, effectively decrementing the reference counts to zero for each page. fput() is called to decrement the usage counter f_count for the associated file object, and if necessary, to release the file object resources. kfree() is called to free the shared memory segment descriptor.

shm_inc()

shm_inc() sets the PID, sets the current time, and increments the number of attachments for the given shared memory segment. These operations are performed while holding the global shared memory spinlock.

shm_close()

shm_close() updates the shm_lprid and the shm_dtim fields and decrements the number of attached shared memory segments. If there are no other attachments to the shared memory segment, then shm_destroy() is called to release the shared memory segment resources. These operations are all performed while holding both the global shared memory semaphore and the global shared memory spinlock.

shmem_file_setup()

The function shmem_file_setup() sets up an unlinked file living in the tmpfs file system with the given name and size. If there are enough systen memory resource for this file, it creates a new dentry under the mount root of tmpfs, and allocates a new file descriptor and a new inode object of tmpfs type. Then it associates the new dentry object with the new inode object by calling d_instantiate() and saves the address of the dentry object in the file descriptor. The i_size field of the inode object is set to be the file size and the i_nlink field is set to be 0 in order to mark the inode unlinked. Also, shmem_file_setup() stores the address of the shmem_file_operations structure in the f_op field, and initializes f_mode and f_vfsmnt fields of the file descriptor properly. The function shmem_truncate() is called to complete the initialization of the inode object. On success, shmem_file_setup() returns the new file descriptor.

5.4 Linux IPC Primitives

Generic Linux IPC Primitives used with Semaphores, Messages,and Shared Memory

The semaphores, messages, and shared memory mechanisms of Linux are built on a set of common primitives. These primitives are described in the sections below.

ipc_alloc()

If the memory allocation is greater than PAGE_SIZE, then vmalloc() is used to allocate memory. Otherwise, kmalloc() is called with GFP_KERNEL to allocate the memory.

ipc_addid()

When a new semaphore set, message queue, or shared memory segment is added, ipc_addid() first calls grow_ary() to insure that the size of the corresponding descriptor array is sufficiently large for the system maximum. The array of descriptors is searched for the first unused element. If an unused element is found, the count of descriptors which are in use is incremented. The kern_ipc_perm structure for the new resource descriptor is then initialized, and the array index for the new descriptor is returned. When ipc_addid() succeeds, it returns with the global spinlock for the given IPC type locked.

ipc_rmid()

ipc_rmid() removes the IPC descriptor from the the global descriptor array of the IPC type, updates the count of IDs which are in use, and adjusts the maximum ID in the corresponding descriptor array if necessary. A pointer to the IPC descriptor associated with given IPC ID is returned.

ipc_buildid()

ipc_buildid() creates a unique ID to be associated with each descriptor within a given IPC type. This ID is created at the time a new IPC element is added (e.g. a new shared memory segment or a new semaphore set). The IPC ID converts easily into the corresponding descriptor array index. Each IPC type maintains a sequence number which is incremented each time a descriptor is added. An ID is created by multiplying the sequence number with SEQ_MULTIPLIER and adding the product to the descriptor array index. The sequence number used in creating a particular IPC ID is then stored in the corresponding descriptor. The existence of the sequence number makes it possible to detect the use of a stale IPC ID.

ipc_checkid()

ipc_checkid() divides the given IPC ID by the SEQ_MULTIPLIER and compares the quotient with the seq value saved corresponding descriptor. If they are equal, then the IPC ID is considered to be valid and 1 is returned. Otherwise, 0 is returned.

grow_ary()

grow_ary() handles the possibility that the maximum (tunable) number of IDs for a given IPC type can be dynamically changed. It enforces the current maximum limit so that it is no greater than the permanent system limit (IPCMNI) and adjusts it down if necessary. It also insures that the existing descriptor array is large enough. If the existing array size is sufficiently large, then the current maximum limit is returned. Otherwise, a new larger array is allocated, the old array is copied into the new array, and the old array is freed. The corresponding global spinlock is held when updating the descriptor array for the given IPC type.

ipc_findkey()

ipc_findkey() searches through the descriptor array of the specified ipc_ids object, and searches for the specified key. Once found, the index of the corresponding descriptor is returned. If the key is not found, then -1 is returned.

ipcperms()

ipcperms() checks the user, group, and other permissions for access to the IPC resources. It returns 0 if permission is granted and -1 otherwise.

ipc_lock()

ipc_lock() takes an IPC ID as one of its parameters. It locks the global spinlock for the given IPC type, and returns a pointer to the descriptor corresponding to the specified IPC ID.

ipc_unlock()

ipc_unlock() releases the global spinlock for the indicated IPC type.

ipc_lockall()

ipc_lockall() locks the global spinlock for the given IPC mechanism (i.e. shared memory, semaphores, and messaging).

ipc_unlockall()

ipc_unlockall() unlocks the global spinlock for the given IPC mechanism (i.e. shared memory, semaphores, and messaging).

ipc_get()

ipc_get() takes a pointer to a particular IPC type (i.e. shared memory, semaphores, or message queues) and a descriptor ID, and returns a pointer to the corresponding IPC descriptor. Note that although the descriptors for each IPC type are of different data types, the common kern_ipc_perm structure type is embedded as the first entity in every case. The ipc_get() function returns this common data type. The expected model is that ipc_get() is called through a wrapper function (e.g. shm_get()) which casts the data type to the correct descriptor data type.

ipc_parse_version()

ipc_parse_version() removes the IPC_64 flag from the command if it is present and returns either IPC_64 or IPC_OLD.

Generic IPC Structures used with Semaphores,Messages, and Shared Memory

The semaphores, messages, and shared memory mechanisms all make use of the following common structures:

struct kern_ipc_perm

Each of the IPC descriptors has a data object of this type as the first element. This makes it possible to access any descriptor from any of the generic IPC functions using a pointer of this data type.


/* used by in-kernel data structures */
struct kern_ipc_perm {
    key_t key;
    uid_t uid;
    gid_t gid;
    uid_t cuid;
    gid_t cgid;
    mode_t mode;
    unsigned long seq;
};

struct ipc_ids

The ipc_ids structure describes the common data for semaphores, message queues, and shared memory. There are three global instances of this data structure-- semid_ds, msgid_ds and shmid_ds-- for semaphores, messages and shared memory respectively. In each instance, the sem semaphore is used to protect access to the structure. The entries field points to an IPC descriptor array, and the ary spinlock protects access to this array. The seq field is a global sequence number which will be incremented when a new IPC resource is created.


struct ipc_ids {
    int size;
    int in_use;
    int max_id;
    unsigned short seq;
    unsigned short seq_max;
    struct semaphore sem;
    spinlock_t ary;
    struct ipc_id* entries;
};

struct ipc_id

An array of struct ipc_id exists in each instance of the ipc_ids structure. The array is dynamically allocated and may be replaced with larger array by grow_ary() as required. The array is sometimes referred to as the descriptor array, since the kern_ipc_perm data type is used as the common descriptor data type by the IPC generic functions.


struct ipc_id {
    struct kern_ipc_perm* p;
};


Next Previous Contents