最近在看一些资料的时候,发现资料中写到poll的文件描述符fd数目没有限制是因为是基于链表实现的。
但是我在康《unix网络编程卷一》的时候发现它对poll的fd集合是这么描述的:
poll


怎么有的说是链表,有的说是数组啊,说法不一致啊。于是我赶紧打开我的主机manual一下看看。

首先我先man 2 select查看了一下select,发现:
select

select传的是fd_set,而fd_set则是大小受到FD_SETSIZE(资料中常说的1024)限制的数组结构:
select

赶紧再man 2 poll查看一下poll:
poll

描述里面清楚的描写着fds就是个存放struct pollfd的数组啊(The set of file descriptors to be monitored is specified in the fds argument, which is an array of structures of the following form),《unix网络编程卷一》的描述这么看并没有问题啊,但是为何好多资料都说poll的fd是基于链表的呢,难道他们都说错了?
疑惑

我急忙去请教大佬
我:大佬大佬,我linux有个不懂得地方。
大佬:啊,去看看源码吧。
我:看不懂咋整。
大佬:不看永远不懂。
我:Emmmm 好像很有道理。

于是迫不得已,我只好硬着头皮去翻Linux kernel代码(linux-5.6.12版本),搜索sys_poll,发现在/fs/select.c中(原来的sys_xxx都变成了SYSCALL_DEFINE宏定义了,找了半天),其中代码如下:

SYSCALL_DEFINE3(poll, struct pollfd __user *, ufds, unsigned int, nfds,
        int, timeout_msecs)
{
    struct timespec64 end_time, *to = NULL;
    int ret;

    if (timeout_msecs >= 0) {
        to = &end_time;
        poll_select_set_timeout(to, timeout_msecs / MSEC_PER_SEC,
            NSEC_PER_MSEC * (timeout_msecs % MSEC_PER_SEC));
    }

    ret = do_sys_poll(ufds, nfds, to);

    if (ret == -ERESTARTNOHAND) {
        struct restart_block *restart_block;

        restart_block = &current->restart_block;
        restart_block->fn = do_restart_poll;
        restart_block->poll.ufds = ufds;
        restart_block->poll.nfds = nfds;

        if (timeout_msecs >= 0) {
            restart_block->poll.tv_sec = end_time.tv_sec;
            restart_block->poll.tv_nsec = end_time.tv_nsec;
            restart_block->poll.has_timeout = 1;
        } else
            restart_block->poll.has_timeout = 0;

        ret = -ERESTART_RESTARTBLOCK;
    }
    return ret;
}

其中完成轮询功能的是ret = do_sys_poll(ufds, nfds, to)这一句,点到do_sys_poll:

static int do_sys_poll(struct pollfd __user *ufds, unsigned int nfds,
        struct timespec64 *end_time)
{
    struct poll_wqueues table;
    int err = -EFAULT, fdcount, len;
    /* Allocate small arguments on the stack to save memory and be
       faster - use long to make sure the buffer is aligned properly
       on 64 bit archs to avoid unaligned access */
    long stack_pps[POLL_STACK_ALLOC/sizeof(long)];
    struct poll_list *const head = (struct poll_list *)stack_pps;
     struct poll_list *walk = head;
     unsigned long todo = nfds;

    if (nfds > rlimit(RLIMIT_NOFILE))
        return -EINVAL;

    len = min_t(unsigned int, nfds, N_STACK_PPS);
    for (;;) {
        walk->next = NULL;
        walk->len = len;
        if (!len)
            break;

        if (copy_from_user(walk->entries, ufds + nfds-todo,
                    sizeof(struct pollfd) * walk->len))
            goto out_fds;

        todo -= walk->len;
        if (!todo)
            break;

        len = min(todo, POLLFD_PER_PAGE);
        walk = walk->next = kmalloc(struct_size(walk, entries, len),
                        GFP_KERNEL);
        if (!walk) {
            err = -ENOMEM;
            goto out_fds;
        }
    }

    poll_initwait(&table);
    fdcount = do_poll(head, &table, end_time);
    poll_freewait(&table);

    for (walk = head; walk; walk = walk->next) {
        struct pollfd *fds = walk->entries;
        int j;

        for (j = 0; j < walk->len; j++, ufds++)
            if (__put_user(fds[j].revents, &ufds->revents))
                goto out_fds;
      }

    err = fdcount;
out_fds:
    walk = head->next;
    while (walk) {
        struct poll_list *pos = walk;
        walk = walk->next;
        kfree(pos);
    }

    return err;
}

我惊了,原来真的有个struct poll_list类型的链表walk啊,这是个在内核空间开辟出来的链表,而我们传入的从用户空间复制来的pollfd则通过copy_from_user方法拷贝给了walk链表,我们需要处理的文件描述符总数nfds则是由调用者传进来的,最后遍历的是内核空间的链表walk。
看来书里说的没错,资料说的也没错,是我自己理解的有问题。我身在第一层,以为自己看到了第五层,其实自己只看到了第三层哈哈 。poll在用户态时的结构的确是数组,但是到了内核态却巧妙的转变成了链表,分配一个pollfd结构的数组并把该数组中元素的数目通知内核成了调用者的责任,内核不再需要知道类似fd_set的固定数据大小的数据类型。
kernel设计者真牛。

Last modification:May 14th, 2020 at 12:11 am
大家一起分享知识,分享快乐