iclr2021 openreview£¬iclr open review
chanong
|¡°ÈôÌýÔ˺ÓÇ峺£¬Á÷ˮ֮Դ±ãÏÖ¡£¡±Ñ§**Ç°ÑØÁìÓò֪ʶ¡¢´ÓÆäËûÑо¿ÁìÓò»ñµÃÁé¸Ð¡¢¸üÇåÎúµØÀí½âÑо¿ÎÊÌâµÄ±¾ÖÊÊÇȡ֮²»¾¡¡¢ÓÃÖ®²»½ßµÄ×ÊÔ´¡£ÐÅÏ¢À´Ô´¡£×ÔÎÒÍêÉÆ¡£Îª´Ë£¬ÎÒÃÇÌØÒ⾫ѡÎÄÕÂÔĶÁ±Ê¼Ç£¬°ïÖúÄú¹ã·ºÉîÈëµØÔĶÁ¿ÆÑÐÎÄÏ×£¬´òÔì¡°»îˮ֮Դ¡±×¨À¸£¬¾´Çë¹Ø×¢¡£
×÷ÕߣºMochen-Fan Hanchie
µØÖ·£ºhttps://www.zhihu.com/people/huang-han-chi-15
https://medium.com/@iclr_conf/ourhatata-the-reviewing-process-and-research-shaping-iclr-in-2020-ea9e53eb4c46 ÕâÊǰüº¬·ÇÇ¿»¯Ñ§**ÂÛÎĵĴÊÔÆÍ¼
01 ¶à´Î³öÏֵĹؼü´Ê£º¶àÖÇÄÜÌå¡¢·Ö²ãÇ¿»¯Ñ§**/¼¼ÄÜ·¢ÏÖ¡¢Ì½Ë÷¡¢¶Ô¿¹ÐÔ¡¢ÔªÇ¿»¯Ñ§**¡¢ÔªÑ§**¡¢Ç¨ÒÆ/·º»¯¡¢½ø»¯¡¢Í¼/GNN/GCN¡¢ÍÆÀí¡¢ÄÚÔÚ½±Àø/ºÃÆæÐÄ¡¢Éú³Éʽ¡¢Ä£·Âѧ**£¬ÎȽ¡£¬
²ÉÑùЧÂÊ/¹À¼Æ¡¢»ùÓÚÄ£ÐÍ¡¢Àë²ßÂÔ¡¢¿Î³Ìѧ**¡¢°²È«/Ô¼ÊøÑ§**
02 ICLR 202 Ç¿»¯Ñ§**Top 10 ÂÛÎÄ
https://analyticsindiamag.com/top-10-reinforcement-learning-papers-from-iclr-2020/
1| ͼ¾í»ýÇ¿»¯Ñ§**
2| ºâÁ¿Ç¿»¯Ñ§**Ëã·¨µÄ¿É¿¿ÐÔ
3| Ç¿»¯Ñ§**ÐÐΪÌ×¼þ
4| ÏÖʵÊÀ½ç»úÆ÷ÈËÇ¿»¯Ñ§**µÄÒªËØ
5| ÍøÂçËæ»ú»¯£º Éî¶ÈÇ¿»¯Ñ§**Öзº»¯µÄ¼òµ¥¼¼Êõ
6| ¹ØÓÚÉñ¾»úÆ÷·ÒëÇ¿»¯Ñ§**µÄÈõµã
7| »ùÓÚÇ¿»¯Ñ§**µÄͼÐòÁÐÄ£ÐÍ£¬ÓÃÓÚ×ÔÈ»ÎÊÌâÉú³É
8| ¶Ô¿¹ÐÔ²ßÂÔ£º ¶ÔÉî¶ÈÇ¿»¯Ñ§**µÄ¹¥»÷
9| ʹÓÃÇ¿»¯Ñ§**·¢ÏÖÒò¹û¹ØÏµ
10| Atari »ùÓÚÄ£Ð͵ÄÇ¿»¯Ñ§**
03 Text 1.¡¶Posterior sampling for multi-agent reinforcement learning: solving extensive games with imperfect information¡· ¹Ø¼ü´Ê£ºMARL¡¢ºó²ÉÑù¡¢²©ÞÄÂÛHIGHLIGHT: ¶ÔÓÚ·ÇÆ½»¬¡¢·Ç͹º¯Êý£¬ÌݶȲüô¿ÉÄÜ»á¼ÓËÙÌݶÈϽµ¡£Ç¿»¯Ñ§**ʺó³éÑù(PSRL) ÊÇÒ»¸ö°ïÖúÄúÔÚδ֪»·¾³ÖÐ×ö³ö¾ö²ßµÄ¿ò¼Ü¡£ PSRL ά»¤»·¾³µÄºóÑé·Ö²¼²¢¹æ»®´ÓºóÑé·Ö²¼ÖвÉÑùµÄ»·¾³¡£¾¡¹ÜPSRL ¶ÔÓÚµ¥ÖÇÄÜÌåÇ¿»¯Ñ§**ÎÊÌâ±íÏÖÁ¼ºÃ£¬µ«½«PSRL Ó¦ÓÃÓÚ¶àÖÇÄÜÌåÇ¿»¯Ñ§**ÎÊÌâÉÐδµÃµ½Ì½Ë÷¡£ÔÚ±¾Ñо¿ÖУ¬ÎÒÃǽ«PSRL À©Õ¹µ½²»ÍêÈ«ÐÅÏ¢µÄÁ½ÈËÁãºÍ²©ÞÄ£¨TEGI£©£¬ÕâÊÇÒ»Àà¶àÖÇÄÜÌåϵͳ¡£¸ü¾ßÌåµØËµ£¬ÎÒÃǽ«PSRL ÓëCounterfactual Regret Minimization (CFR) ½áºÏÆðÀ´£¬ÕâÊÇTEGI ÔÚÒÑÖª»·¾³ÖеÄÖ÷ÒªËã·¨¡£ÎÒÃǵÄÖ÷Òª¹±Ï×ÔÚÓÚ½»»¥²ßÂÔµÄÈ«ÐÂÉè¼Æ£¬ÕâΪËã·¨ÌṩÁËÁ¼ºÃµÄÀíÂÛºÍʵÑé±£Ö¤¡£
2.¡¶Dynamics-Aware Unsupervised Skill Discovery¡· ¹Ø¼ü´Ê£ºÎ޼ලѧ**¡¢»ùÓÚÄ£Ð͵Äѧ**¡¢·Ö²ãÇ¿»¯Ñ§**HIGHLIGHT: ÎÒÃÇÌá³öÁËÒ»ÖÖÎ޼ල¼¼ÄÜ·¢ÏÖ·½·¨£¬¿ÉÒÔʵÏÖ»ùÓÚÄ£Ð͵ķֲãÇ¿»¯Ñ§**¹æ»®¡£´«Í³ÉÏ£¬»ùÓÚÄ£Ð͵ÄÇ¿»¯Ñ§**£¨MBRL£©Ö¼ÔÚѧ**»·¾³¶¯Ì¬µÄÈ«¾ÖÄ£ÐÍ¡£ºÃµÄÄ£ÐÍÔÊÐí¹æ»®Ëã·¨Éú³É²»Í¬µÄÐÐΪ²¢¿ÉÄܽâ¾ö²»Í¬µÄÈÎÎñ¡£È»¶ø£¬Ñ§**¸´ÔÓ¶¯Á¦ÏµÍ³µÄ׼ȷģÐÍÈÔÈ»ºÜÀ§ÄÑ£¬¼´Ê¹¿ÉÒÔ£¬¸ÃÄ£ÐÍÒ²¿ÉÄÜÎÞ·¨Íƹ㵽³¬³öÆäѵÁ·×´Ì¬·Ö²¼µÄ·¶Î§¡£ÔÚÕâÏ×÷ÖУ¬ÎÒÃǽ«»ùÔªµÄ»ùÓÚÄ£Ð͵Äѧ**ºÍÎÞÄ£ÐÍѧ**½áºÏÆðÀ´£¬ÒÔ´Ù½ø»ùÓÚÄ£Ð͵Ĺ滮¡£ÎÒÃÇÒª»Ø´ðµÄÎÊÌâÊÇ£ºÈçºÎÕÒµ½½á¹ûÈÝÒ×Ô¤²âµÄ¼¼ÄÜ£¿ÎªÁËʵÏÖÕâһĿ±ê£¬ÎÒÃÇÌá³öÁ˶¯Ì¬¸ÐÖª¼¼ÄÜ·¢ÏÖ£¨DADS£©£¬ÕâÊÇÒ»ÖÖÎ޼ලѧ**Ëã·¨£¬¿ÉÒÔ·¢ÏÖ¿ÉÔ¤²âµÄÐÐΪ²¢Í¬Ê±Ñ§**Æä¶¯Ì¬¡£ÀíÂÛÉÏ£¬ÎÒÃǵķ½·¨¿ÉÒÔÀûÓÃÁ¬ÐøµÄ¼¼Äܿռ䣬ÉõÖÁÔÚ¸ßά״̬¿Õ¼äÖÐѧ**ÎÞÏÞ¶àÖÖÐÐΪ¡£ÔÚѧ**µÄDZÔÚ¿Õ¼äÖнøÐÐÖ𲽹滮ÏÔ×ÅÓÅÓÚ±ê×¼MBRL ºÍÎÞÄ£ÐÍÄ¿±êÌõ¼þRL£¬¿ÉÒÔ´¦ÀíÏ¡Êè½±ÀøÈÎÎñ£¬²¢¸Ä½øÏÖÓеÄÎ޼ල¼¼ÄÜ·¢ÏÖ·½·¨¡£ÎÒÃÇÖ¤Ã÷ÁË·Ö²ãRL ¼¼Êõ¿ÉÒԵõ½ÏÔןĽø¡£´úÂ룺https://github.com/google-research/dads
3.¡¶Harnessing Structures for Value-Based Planning and Reinforcement Learning¡· ¹Ø¼ü´Ê£º»ùÓÚ¼ÛÖµµÄÇ¿»¯Ñ§**ÁÁµã£ºÎÒÃÇÌá³öÁËÒ»¸öͨÓÿò¼Ü£¬¿ÉÒÔÔڹ滮ºÍÉî¶ÈÇ¿»¯Ñ§**ÖÐÀûÓõͽ׽ṹ¡£ÔÚ±¾ÎÄÖУ¬ÎÒÃǽ¨Ò鿪·¢ÓÃÓڹ滮ºÍDRL µÄ״̬¶¯×÷¼ÛÖµº¯Êý£¨¼´Q º¯Êý£©»ù´¡ÉèÊ©¡£Èç¹ûµ×²ãϵͳµÄ¶¯Ì¬µ¼ÖÂQ º¯ÊýµÄijÖÖÈ«¾Ö½á¹¹£¬ÎÒÃÇÓ¦¸ÃÄܹ»Í¨¹ýÀûÓÃÕâÖֽṹ¸üºÃµØÍƶϸú¯Êý¡£¾ßÌåÀ´Ëµ£¬ÎÒÃÇÑо¿´óÊý¾Ý¾ØÕóÖÐÆÕ±é´æÔڵĵÍÖȽṹ£¬²¢Æ¾¾ÑéÑéÖ¤µÍÖÈQ º¯ÊýÔÚ¿ØÖƺÍDRL ÈÎÎñÖеĴæÔÚ¡£Í¨¹ýÀûÓþØÕó¹À¼Æ£¨ME£©¼¼Êõ£¬ÎÒÃÇÌá³öÁËÒ»¸öͨÓÿò¼ÜÀ´ÀûÓÃQ º¯ÊýµÄµ×²ãµÍÖȽṹ¡£ÕâΪ¾µä¿ØÖÆÌṩÁ˸üÓÐЧµÄ¹æ»®¹ý³Ì£¬²¢ÇÒ»¹ÔÊÐí½«¼òµ¥µÄ·½°¸Ó¦ÓÃÓÚ»ùÓÚÖµµÄRL ¼¼Êõ£¬ÒÔÔÚ¡°µÍµÈ¼¶¡±ÈÎÎñÉÏʼÖÕ»ñµÃ¸üºÃµÄÐÔÄÜ¡£¶Ô¿ØÖÆÈÎÎñºÍAtari ÓÎÏ·µÄ¹ã·ºÊµÑé֤ʵÁËÎÒÃÇ·½·¨µÄÓÐЧÐÔ¡£´úÂ룺https://github.com/YyzHarry/SV-RL
4.¡¶Causal Discovery with Reinforcement Learning¡· ¹Ø¼ü´Ê£ºÒò¹û·¢ÏÖ¡¢½á¹¹»¯Ñ§**¡¢Ç¿»¯Ñ§**¡¢ÓÐÏòÎÞ»·Í¼ÁÁµã£º½«Ç¿»¯Ñ§**Ó¦ÓÃÓÚ»ùÓÚ·ÖÊýµÄÒò¹û·¢ÏÖ£¬ÔںϳÉÊý¾Ý¼¯ºÍÕæÊµÊý¾Ý¼¯É϶¼È¡µÃÁËÓÐÏ£ÍûµÄ½á¹û¡£ÔÚ±¾ÎÄÖУ¬»ªÎªÅµÑÇ·½ÖÛÑо¿ÔºÒò¹û¹ØÏµÑо¿ÍŶӽ«Ç¿»¯Ñ§**Ó¦ÓÃÓÚ´ò·Ö·½·¨µÄÒò¹û·¢ÏÖËã·¨ÖУ¬²¢Ê¹ÓûùÓÚ×Ô×¢ÒâÁ¦»úÖÆµÄ±àÂëÆ÷-½âÂëÆ÷µÄÉñ¾ÍøÂçÄ£ÐÍÀ´È·¶¨¹ØÏµÊý¾ÝÖ®¼äµÄ¹ØÏµ½øÐе÷²é²¢ÓëÒò¹û½á¹¹Ïà½áºÏ¡£ÉèÖÃÌõ¼þ£¬ÀûÓòßÂÔÌݶÈÇ¿»¯Ñ§**Ë㷨ѵÁ·Éñ¾ÍøÂç²ÎÊý£¬×îÖյõ½Òò¹ûͼ½á¹¹¡£¶ÔÓÚѧÊõ½ç³£ÓõÄһЩÊý¾ÝÄ£ÐÍ£¬¸Ã·½·¨ÔÚÖеȴóСµÄͼÉÏÓÅÓÚÆäËû·½·¨£¬ÀýÈ紫ͳµÄÒò¹û¹ØÏµ·¢ÏÖËã·¨ºÍ¸üеĻùÓÚÌݶȵÄËã·¨¡£Í¬Ê±£¬¸Ã·½·¨·Ç³£Áé»î£¬¿ÉÒÔÓëÈÎºÎÆÀ·Öº¯Êý½áºÏʹÓá£
5.¡¶SEED RL: Scalable and Efficient Deep-RL with Accelerated Central Inference¡· ¹Ø¼ü´Ê£º»úÆ÷ѧ**¡¢¿ÉÀ©Õ¹ÐÔ¡¢·Ö²¼Ê½¡¢DeepMind Lab¡¢ALE¡¢Atari-57¡¢Google Research Football ÎÒÃÇÌṩ×îÏȽøµÄ¿ÉÀ©Õ¹Ç¿»¯Ñ§**£¬³ÆÎªSEED£¨¿ÉÀ©Õ¹¡¢¸ßЧÉî¶ÈÇ¿»¯Ñ§**£©´úÀí¡£ ÎÒÊÇ¡£Í¨¹ýÓÐЧÀûÓÃÏÖ´ú¼ÓËÙÆ÷£¬Ëã·¨²»½ö¿ÉÒÔÿÃëÊý°ÙÍòÖ¡½øÐÐѵÁ·£¬»¹¿ÉÒÔ½µµÍ³É±¾¡£Ó뵱ǰ·½·¨Ïà±È£¬ÎÒÃÇͨ¹ý¾ßÓм¯ÖÐÍÆÀíºÍÓÅ»¯Í¨ÐŲãµÄ¼òµ¥¼Ü¹¹À´ÊµÏÖÕâһĿ±ê¡£ SEED ²ÉÓÃÁ½ÖÖÏÖ´ú·Ö²¼Ê½Ëã·¨£ºIMPALA/V-trace£¨²ßÂÔÌݶȣ©ºÍR2D2£¨Q ѧ**£©£¬²¢ÔÚAtari-57¡¢DeepMind Lab ºÍGoogle Research Football ÉÏ¶ÔÆä½øÐÐÆÀ¹À¡£ÐÂË㷨ˮƽ¸ß¡¢³É±¾µÍ¡£´úÂ룺https://drive.google.com/file/d/144yp7PQf486dmctE2oS2md_qmNBTFbez/view
6.¡¶Is a Good Representation Sufficient for Sample Efficient Reinforcement Learning¡· ¹Ø¼ü´Ê£ºº¯Êý±Æ½ü¡¢Ï½硢±í´ïʽÁÁµã£ºº¯Êý±Æ½üµÄ»ùÓÚÖµºÍ»ùÓÚ²ßÂÔµÄÇ¿»¯Ñ§**µÄÖ¸ÊýϽ硣
7.¡¶Simplified Action Decoder for Deep Multi-Agent Reinforcement Learning¡· ¹Ø¼ü´Ê£º¶àÖÇÄÜÌåRL£¬ÐÄÖÇÀíÂÛÁÁµã£º¿ª·¢ÁËÒ»¸ö¼ò»¯µÄ¶¯×÷½âÂëÆ÷£¬Ò»¸ö¼òµ¥µÄMARLËã·¨£¬ÔÚ2-5ÈËÓÎÏ·ÖпÉÒÔÃ÷ÏÔÓÅÓÚHanabiµÄSOTA¡£ÔÚ±»ËûÈ˹۲ìµÄͬʱѧ»á±äµÃÓÐÓÃÊÇÇ¿»¯Ñ§**£¨RL£©ÖеÄÒ»¸öÓÐȤµÄÌôÕ½¡£Ç¿»¯Ñ§**±¾ÖÊÉÏÐèÒªÖÇÄÜÌå½øÐÐ̽Ë÷£¬ÒԱ㷢ÏֺõIJßÂÔ£¬µ«Êǵ±¼òµ¥µØ½øÐÐ̽Ë÷ʱ£¬ÆäËæ»úÐÔ±¾ÖÊÉϵ¼ÖÂÖÇÄÜÌåÔÚѵÁ·¹ý³ÌÖеÄÐÐΪÓëÆäËûÖÇÄÜÌ岻ͬ£¬¸ø³öµÄÐÅÏ¢½ÏÉÙ¡£ÎÒÃÇÌá³öÁËÒ»ÖÖеÄÉî¶È¶àÖÇÄÜÌåÇ¿»¯Ñ§**·½·¨¡ª¡ª ¼ò»¯¶¯×÷½âÂëÆ÷£¨SAD£©¡£Ëüͨ¹ýÔö¼ÓÇ¿»¯ÑµÁ·½×¶ÎÀ´½â¾öÕâһì¶Ü¡£ SAD ÔÊÐíÆäËûÖÇÄÜÌå²»½ö¿ÉÒÔ¹Û²ì×Ô¼ºÑ¡ÔñµÄ£¨Ì½Ë÷ÐÔ£©ÐÐΪ£¬»¹¿ÉÒÔ¹Û²ì¶ÓÓÑÔÚѵÁ·ÆÚ¼äµÄ̰À·ÐÐΪ¡£±¾ÎĽ«ÕâÖÖ¼òµ¥µÄÖ±¾õÓ븨ÖúÈÎÎñºÍ״̬Ԥ²âµÄ¶àÖÇÄÜÌåѧ**µÄ×î¼Ñʵ¼ùÏà½áºÏ¡£´úÂ룺https://bit.ly/2mBJLyk
8.¡¶Behaviour Suite for Reinforcement Learning¡· ¹Ø¼ü´Ê£º»ù×¼¡¢ºËÐÄÎÊÌâ¡¢¿ÉÀ©Õ¹ÐÔ¡¢¿ÉÖØ¸´ÐÔÁÁµã£ºBsuite ÊÇһϵÁо«ÐÄÉè¼ÆµÄʵÑéµÄ¼¯ºÏ£¬ÓÃÓÚÑо¿RL ÖÇÄÜÌåµÄºËÐŦÄÜ¡£´úÂ룺https://github.com/deepmind/bsuite
9.¡¶Model Based Reinforcement Learning for Atari¡· ¹Ø¼ü´Ê£º»ùÓÚÄ£Ð͵ÄRL¡¢ÊÓÆµÔ¤²âÄ£ÐÍ¡¢atariHIGHLIGHT£ºÊÓÆµÔ¤²âÄ£ÐÍ¡¢»ùÓÚÄ£Ð͵ÄÇ¿»¯Ñ§**Ëã·¨¡¢Ã¿³¡ÓÎÏ·2СʱµÄÓÎϷʱ¼äÀ´ÑµÁ·26¸öAtariÓÎÏ·´úÀí¡£ÔÚ±¾ÎÄÖУ¬ÎÒÃÇ̽ÌÖÁËÈçºÎʹÓÃÊÓÆµÔ¤²âÄ£ÐÍÀ´Ê¹´úÀíÄܹ»ÒÔ±ÈÎÞÄ£ÐÍ·½·¨¸üÉٵĽ»»¥À´½â¾öAtari ÓÎÏ·¡£ÎÒÃdz¢ÊÔÁ˼¸ÖÖ¸ÅÂÊÊÓÆµÔ¤²â¼¼Êõ£¬°üÀ¨»ùÓÚÀëɢDZÔÚ±äÁ¿µÄÐÂÄ£ÐÍ£¬²¢ÀûÓÃÕâЩÊÓÆµÔ¤²â¼¼ÊõÀ´Ä£Äâѧ**Ä£ÐÍÀ´ÑµÁ·ÒªÔÚÓÎÏ·ÖÐÖ´ÐеIJßÂÔ¡£Ìá³öÁËÒ»ÖÖ³ÆÎªÄ£Äâ²ßÂÔѧ**£¨SimPLe£©µÄ·½·¨¡£´úÂ룺http://bit.ly/2wjgn1a
10.¡¶Measuring the Reliability of Reinforcement Learning Algorithms¡· ¹Ø¼ü´Ê£ºÖ¸±ê¡¢Í³¼Æ¡¢¿É¿¿ÐÔÁÁµã£ºÓÃÓÚ²âÁ¿Ç¿»¯Ñ§**Ëã·¨¿É¿¿ÐÔ£¨ÑµÁ·ÆÚ¼äºÍѧ**ºó£¨»ùÓڹ̶¨²ßÂÔ£©£©µÄÒ»×éÐÂÖ¸±ê£¨+Ëæ¸½µÄͳ¼Æ²âÊÔ£©²àÖØÓÚÁ½Õß²¨¶¯ÐÔ£©ºÍ·çÏÕ£©´úÂ룺https://github.com/google-research/rl-reliability-metrics
11.¡¶The Ingredients of Real World Robotic Reinforcement Learning¡· ¹Ø¼ü´Ê£º»úÆ÷ÈËÁÁµã£ºÍ¨¹ýÇ¿»¯Ñ§**ѧ**ÏÖʵÊÀ½ç»úÆ÷ÈËÈÎÎñµÄÃ⹤¾ßϵͳ¡£±¾ÎĽéÉÜÁ˽«RL ²¿Êðµ½ÕæÊµÎïÀí»úÆ÷ÈËϵͳµÄʵ¼ÊÎÊÌâºÍ½â¾ö·½°¸£¬°üÀ¨Ê¹ÓÃÔʼ¸Ð¹ÙÊý¾Ý¡¢´´½¨½±Àøº¯ÊýÒÔ¼°ÔÚÇé½Ú½áÊøÊ±²»ÖØÖõÄÎÊÌâ¡£
12.¡¶Maximum Likelihood Constraint Inference for Inverse Reinforcement Learning¡· ¹Ø¼ü´Ê£º´ÓÑÝʾÖÐѧ**¡¢ÄæÏòÇ¿»¯Ñ§**¡¢Ô¼ÊøÍÆÀíÁÁµã£ºÊ¹ÓÃ×î´óìØÔÀíÀ´Á¿»¯ÑÝʾÓëÔ¤ÆÚÎÞÔ¼ÊøÐÐΪ֮¼äµÄ²îÒì¡¢ÈÎÎñÖ´ÐÐÔ¼ÊøÍÆÀí¡£ÎÒÃǸù¾ÝÂí¶û¿É·ò¾ö²ß¹ý³Ì£¨MDP£©ÖØÐ±íÊöÁËIRL ÎÊÌâ¡£ÔÚÄÇÀ¸ø¶¨»·¾³µÄÃûÒåÄ£ÐͺÍÃûÒå½±Àøº¯Êý£¬ÎÒÃdz¢ÊÔ¹À¼Æ¼¤·¢´úÀíÁ¼ºÃÐÐΪµÄ»·¾³¡¢ÐÐΪºÍÌØÕ÷Ô¼Êø¡£ÎÒÃǵķ½·¨»ùÓÚ×î´óìØIRL ¿ò¼Ü£¬ÕâʹÎÒÃÇÄܹ»¸ù¾ÝÎÒÃǶÔMDP µÄÁ˽âÀ´ÍƶÏר¼Ò´úÀí½øÐÐÑÝʾµÄ¿ÉÄÜÐÔ¡£ÐÂË㷨ʹÎÒÃÇÄܹ»¹À¼ÆÄÄÐ©Ô¼Êø¿ÉÒÔÌí¼Óµ½MDP ÖУ¬ÒÔ×î´óÏ޶ȵØÌá¸ß¹Û²ìÕâЩÑÝʾµÄ¿ÉÄÜÐÔ¡£ÐÂËã·¨µü´úµØÍƶϳö×îÄܽâÊ͹۲쵽µÄÐÐΪµÄ×î´óËÆÈ»Ô¼Êø£¬²¢Ê¹ÓÃÄ£ÄâÐÐΪºÍÈËÀàÈÆ¹ýÕ**ÎïµÄ¼Ç¼Êý¾ÝÀ´²âÊÔÆäÓÐЧÐÔ¡£´úÂ룺https://drive.google.com/drive/folders/1pJ7o4w4J0_dpldTRpFu_jWQR8CkBbXw
13.¡¶Improving Generalization in Meta Reinforcement Learning using Neural Objectives¡· ¹Ø¼ü´Ê£ºÔªÇ¿»¯Ñ§**£¬ÔªÑ§**ÁÁµã£ºÎÒÃÇÒýÈëÁËÒ»ÖÖеÄԪǿ»¯Ñ§**Ëã·¨£¬MetaGenRL¡£Óë֮ǰµÄ¹¤×÷²»Í¬£¬MetaGenRL ¿ÉÒÔÍÆ¹ãµ½ÓëԪѵÁ·ÍêÈ«²»Í¬µÄл·¾³¡£ÉúÎï½ø»¯½«Ðí¶àѧ**ÕߵľÑéÌáÁ¶³ÉÈËÀàͨÓõÄѧ**Ëã·¨¡£ÎÒÃÇеÄԪǿ»¯Ñ§**Ëã·¨MetaGenRL ¾ÍÊÇÊܵ½Õâ¸ö¹ý³ÌµÄÆô·¢¡£ MetaGenRL ÌáÈ¡Ðí¶à¸´ÔÓÖÇÄÜÌåµÄ¾ÑéÀ´ÔªÑ§**µÍ¸´ÔӶȵÄÉñ¾Ä¿±êº¯Êý£¬¸Ãº¯Êý¾ö¶¨¸öÈËδÀ´µÄѧ**·½Ê½¡£Óë×î½üµÄԪǿ»¯Ñ§**Ëã·¨²»Í¬£¬MetaGenRL ¿ÉÒÔÍÆ¹ãµ½ÓëԪѵÁ·ÍêÈ«²»Í¬µÄл·¾³¡£ÔÚijЩÇé¿öÏ£¬ËüÉõÖÁ¿ÉÒÔ³¬Ô½ÊÖ¶¯Éè¼ÆµÄÇ¿»¯Ñ§**Ëã·¨¡£ MetaGenRL ÔÚԪѵÁ·ÆÚ¼äʹÓÃÀë²ßÂÔ¶þ´ÎÌݶȡ£Õâ´ó´óÌá¸ßÁ˲ÉÑùЧÂÊ¡£
14.¡¶Making Sense of Reinforcement Learning and Probabilistic Inference¡· ¹Ø¼ü´Ê£º¸ÅÂÊÍÆÀí¡¢²»È·¶¨ÐÔ¡¢Ì½Ë÷ÁÁµã£º¡°RL ×÷ÎªÍÆÀí¡±Öеij£¼ûËã·¨ºöÂÔÁ˲»È·¶¨ÐÔºÍ̽Ë÷µÄ×÷Óá£ÎÒÃÇÇ¿µ÷ÕâЩÎÊÌâµÄÖØÒªÐÔ£¬²¢Ìá³öÁËÒ»¸öÒ»ÖµÄÇ¿»¯Ñ§**ºÍÍÆÀí¿ò¼Ü£¬ÒÔÕýÈ·´¦Àí²»È·¶¨ÐÔºÍ̽Ë÷¡£Ç¿»¯Ñ§**£¨RL£©½áºÏÁË¿ØÖÆÎÊÌâºÍͳ¼ÆÍƶϡ£ÖÇÄÜÌå²»ÖªµÀϵͳµÄ¶¯Ì¬£¬µ«¿ÉÒÔͨ¹ý¾Ñé½øÐÐѧ**¡£×î½üµÄ¹¤×÷Ìá³öÁËÒ»¸öÃûΪ¡°Ç¿»¯Ñ§**ÍÆÀí¡±µÄ¾ßÌå¿ò¼Ü£¬Ëü½«Ç¿»¯Ñ§**ÎÊÌâÍÆ¹ãµ½¸ÅÂÊÍÆÀí¡£ÔÚÎÒÃǵÄÂÛÎÄÖУ¬ÎÒÃǽÒʾÁËÕâÖÖ·½·¨µÄÖ÷Ҫȱµã£¬²¢½âÊÍÁËʹRL Ò»ÖµØÍÆ¹ãµ½ÍÆÀíÎÊÌâµÄº¬Òå¡£ÌØ±ðÊÇ£¬Ç¿»¯Ñ§**ÖÇÄÜÌ屨Ð뿼ÂÇ̽Ë÷ºÍÀûÓÃÖ®¼äµÄȨºâ¡£³ýÁË×î¼òµ¥µÄÉèÖÃÖ®Í⣬ÔÚËùÓÐÉèÖÃÖУ¬ÍÆÀíÔÚ¼ÆËãÉ϶¼ºÜÀ§ÄÑ£¬Òò´ËÕæÕýµÄÇ¿»¯Ñ§**Ëã·¨±ØÐëÒÀÀµÓÚ½üËÆ¼¼ÇÉ¡£ÎÒÃÇÖ¤Ã÷£¬¼´Ê¹¶ÔÓڷdz£»ù±¾µÄÎÊÌ⣬³£¼ûµÄ¡°Ç¿»¯Ñ§**×÷ÎªÍÆÀí¡±½üËÆÒ²»á½µµÍÐÔÄÜ¡£È»¶ø£¬ÎÒÃDZíÃ÷£¬Ö»Ðè½øÐÐÉÙÁ¿Ð޸ģ¬¸Ã¿ò¼Ü¾Í¿ÉÒÔ²úÉú¾ßÓÐÃ÷ÏÔÓÅÔ½ÐÔÄܵÄËã·¨£¬²¢ÇÒÎÒÃDZíÃ÷£¬ÐÂËã·¨Ï൱ÓÚ×î½üÌá³öµÄK-learning£¬ËüÓëThompson ²ÉÑùÏà¹Ø¡£
15.¡¶Reinforcement Learning Based Graph-to-Sequence Model for Natural Question Generation¡· ¹Ø¼ü´Ê£ºÉî¶Èѧ**¡¢Í¼Éñ¾ÍøÂç¡¢×ÔÈ»ÓïÑÔ´¦Àí¡¢ÎÊÌâÉú³ÉÁÁµã£º×ÔÈ»ÎÊÌâÉú³É£¨QG£©Ö¼ÔÚ¸ù¾Ý¾ä×Ӻʹð°¸Éú³ÉÎÊÌâ¡£ÏÈǰ¹ØÓÚQG µÄÑо¿£¨i£©ºöÂÔÁËÒþ²ØÔÚÎı¾Öеķḻ½á¹¹ÐÅÏ¢£¬£¨ii£©½öÒÀÀµ½»²æìØËðʧ£¬Õâµ¼ÖÂÁ˱©Â¶Æ«²îºÍѵÁ·/²âÊÔ²âÁ¿Ö®¼ä²»Æ¥ÅäµÈÎÊÌâ¡£»òÕߣ¨iii£©Ê¹´ó²¿·ÖÏìÓ¦ÐÅÏ¢¡£ÎªÁ˽â¾öÕâЩÏÞÖÆ£¬ÎÒÃÇÕë¶ÔQG ÎÊÌâÌá³öÁËÒ»ÖÖеĻùÓÚRL µÄGrappSeq Ä£ÐÍ¡£ÔÚ´ËÄ£ÐÍÖУ¬¸ßЧµÄÉî¶È¶ÔÆëÍøÂçÀûÓÃÏìÓ¦ÐÅÏ¢¡£ÎÒÃÇ»¹Ìá³öÁËÒ»ÖÖеÄË«ÏòGNN À´´¦ÀíÓÐÏòͨµÀͼ¡£ÎÒÃǵÄÁ½²½ÑµÁ·²ßÂÔÊÜÒæÓÚ»ùÓÚ½»²æìغͻùÓÚÔöÇ¿µÄÐòÁÐѵÁ·¡£ÎÒÃÇ»¹¿¼ÂÇ´ÓÎı¾¹¹½¨¾²Ì¬ºÍ¶¯Ì¬Í¼£¬²¢ÏµÍ³µØµ÷²éºÍ·ÖÎöÁ½ÕßÖ®¼äµÄÐÔÄܲîÒì¡£´úÂ룺https://github.com/hugochan/RL-based-GrappSeq-for-NQG
16.¡¶On the Weaknesses of Reinforcement Learning for Neural Machine Translation¡· ¹Ø¼ü´Ê£ºMRT¡¢×îС·çÏÕѵÁ·¡¢Ç¿»¯¡¢»úÆ÷·Òë¡¢Peakkinesity¡¢Éú³ÉÁÁµã£ºÌá¸ß»úÆ÷·ÒëÐÔÄܵÄÇ¿»¯Êµ¼ù¿ÉÄܲ»»áÀ´×Ô¸üºÃµÄÔ¤²â¡£ÕªÒª£ºÇ¿»¯Ñ§**£¨RL£©Í¨³£ÓÃÓÚÌá¸ßÎı¾Éú³ÉÈÎÎñ£¨°üÀ¨»úÆ÷·Ò룩µÄÐÔÄÜ£¬ÌرðÊÇͨ¹ýʹÓÃ×îС·çÏÕѵÁ·£¨MRT£©ºÍÉú³É¶Ô¿¹ÍøÂ磨GAN£©¡£È»¶ø£¬ÈËÃǶÔÕâЩ·½·¨ÔÚMT ±³¾°ÏÂѧ**ʲôÒÔ¼°ÈçºÎѧ**ÖªÖ®ÉõÉÙ¡£×î³£¼ûµÄMT RL ¼¼ÊõÖ®Ò»²¢²»ÄÜÓÅ»¯Ô¤ÆÚ»Ø±¨£¬¶øÆäËû¼¼ÊõÔò±»Ö¤Ã÷¹ýÓÚºÄʱ¡£ÊÂʵÉÏ£¬ÊµÑé½á¹û±íÃ÷£¬Ö»Óе±Ô¤ÑµÁ·²ÎÊýÒѾ½Ó½ü²úÉúÕýÈ··Òëʱ£¬Ê¹ÓÃMT Á·**RL ²Å¿ÉÄÜÌá¸ßÐÔÄÜ¡£ÎÒÃǵÄÑо¿½á¹û½øÒ»²½±íÃ÷£¬¹Û²ìµ½µÄÔöÒæ¿ÉÄÜÊÇÓÉÓÚÓëѵÁ·ÐźÅÎ޹صÄÓ°Ï죬ÀýÈç·Ö²¼ÇúÏßÐÎ×´µÄ±ä»¯¡£
17.¡¶SQIL: Imitation Learning via Reinforcement Learning with Sparse Rewards¡· ¹Ø¼ü´Ê£ºÄ£·Âѧ**ÁÁµã£º¶Ô¿¹ÐÔÄ£Äâѧ**µÄ¼òµ¥ÓÐÐ§Ìæ´ú·½°¸£ºÍ¨¹ýÑÝʾ³õʼ»¯¾ÑéÖØ·Å»º³åÇø£¬²¢½«½±ÀøÉèÖÃΪ+1£¬½«ËùÓÐÆäËûÊý¾Ý½±ÀøÉèÖÃΪ0 ÅäÖò¢ÔËÐÐQ Learning »òÈí¼þ¡£ÑÝÔ±ºÍÆÀÂÛ¼ÒÅàѵ¡£´ÓÄ£·ÂÖÐѧ**Ä£·Âר¼ÒµÄÐÐΪ¿ÉÄܺÜÀ§ÄÑ£¬ÌرðÊÇÔÚ¸ßά¶È¡¢Á¬Ðø¹Û²ìºÍδ֪¶¯Ì¬µÄ»·¾³ÖС£»ùÓÚÐÐΪ¿Ë¡£¨BC£©µÄ¼à¶½Ñ§**·½·¨´æÔÚ·Ö²¼×ªÒÆÎÊÌâ¡£ÓÉÓÚÖÇÄÜÌå̰À·µØÄ£·ÂËùÑÝʾµÄ¶¯×÷²¢»ýÀÛ´íÎó£¬Òò´ËËü¿ÉÄÜ»áÆ«ÀëËùÑÝʾµÄ״̬¡£×î½ü»ùÓÚÇ¿»¯Ñ§**(RL) µÄ·½·¨£¬ÀýÈçÄæÏòRL ºÍÉú³É¶Ô¿¹ÐÔÄ£·Âѧ**(GAIL)£¬Í¨¹ýѵÁ·RL ´úÀíËæ×Åʱ¼äµÄÍÆÒÆÆ¥ÅäÑÝʾÀ´¿Ë·þÕâ¸öÎÊÌâ¡£ÓÉÓÚÈÎÎñµÄÕæÊµ½±Àøº¯ÊýÊÇδ֪µÄ£¬ÕâЩ·½·¨Í¨³£Í¨¹ýʹÓø´ÔÓÇÒÈõµÄ½üËÆ¼¼ÊõºÍ¶Ô¿¹ÐÔѵÁ·µÄÑÝʾÀ´Ñ§**½±Àøº¯Êý¡£ÎÒÃÇÌá³öÁËÒ»¸ö¼òµ¥µÄÌæ´ú·½°¸£¬ÈÔȻʹÓÃÇ¿»¯Ñ§**£¬µ«²»ÐèҪѧ**½±Àøº¯Êý¡£¹Ø¼ü˼ÏëÊǹÄÀø´úÀíÔÚÓöµ½ÐµĽ»¸¶×´Ì¬Ê±·µ»Øµ½ÑÝʾ״̬£¬´Ó¶ø¹ÄÀøËûÃÇËæ×Åʱ¼äµÄÍÆÒÆÆ¥ÅäÑÝʾ¡£ÎªÁËʵÏÖÕâһĿ±ê£¬ÎÒÃÇΪ´úÀíÌṩºã¶¨½±Àør=+ 1 ÒÔÆ¥ÅäÑÝʾ״̬ϵÄÑÝʾ¶¯×÷£¬²¢ÎªËùÓÐÆäËû¶¯×÷Ìṩºã¶¨½±Àør=0¡£ÐÂËã·¨Soft-Q Ä£·Âѧ**(SQIL) ¿ÉÒÔͨ¹ý¶Ô±ê×¼Q ѧ**Ëã·¨»ò²ßÂÔËɳÚActor-Critic Ëã·¨½øÐÐÇá΢ÐÞ¸ÄÀ´ÊµÏÖ¡£ÎÒÃÇ´ÓÀíÂÛÉÏÖ¤Ã÷£¬SQIL ¿ÉÒÔ½âÊÍΪBC µÄÕýÔò»¯±äÌ壬ËüÔÚ¹ÄÀø³¤ÆÚÄ£·Â֮ǰʹÓÃÏ¡ÊèÐÔ¡£ÔÚBox2D¡¢Atari ºÍMuJoCo Éϵĸ÷ÖÖ»ùÓÚͼÏñµÄµÍάÈÎÎñÉÏ£¬SQIL µÄÐÔÄÜÓÅÓÚBC£¬²¢ÇÒÓëGAIL Ïà±È»ñµÃÁËÓоºÕùÁ¦µÄ½á¹û¡£±¾ÎÄÖ÷ÒªÑÝʾÁ˾ßÓк㶨½±ÀøµÄ¼òµ¥µÄ»ùÓÚÇ¿»¯Ñ§**µÄÄ£·Â¼¼ÊõÈçºÎÓëʹÓÃѧ**½±ÀøµÄ¸ü¸´Ôӵļ¼ÊõÒ»ÑùÓÐЧ¡£
18.¡¶AutoQ: Automated Kernel-Wise Neural Network Quantization¡· ¹Ø¼ü´Ê£ºAutoML¡¢kernel-wise Éñ¾ÍøÂçÁ¿»¯¡¢·Ö²ãÉî¶ÈÇ¿»¯Ñ§**ÁÁµã£ºÊ¹Ó÷ֲãÉî¶ÈÇ¿»¯Ñ§**ʵÏÖ»ìºÏ¾«¶ÈµÄ׼ȷ¡¢¿ìËÙ¡¢×Ô¶¯»¯µÄkernel-wise Éñ¾ÍøÂçÁ¿»¯¡£ÔÚ±¾ÎÄÖУ¬ÎÒÃÇÌá³öÁËAutoQ£¬Ò»ÖÖ»ùÓÚ·Ö²ãDRL µÄ»ùÓÚÄں˵ÄÍøÂçÁ¿»¯¼¼Êõ£¬Ëü×Ô¶¯ÎªÃ¿¸öÈ¨ÖØÄÚºËËÑË÷QBN£¬²¢ÎªÃ¿¸ö¼¤»î²ãÑ¡Ôñ²»Í¬µÄQBN¡£Óë×îÏȽøµÄ»ùÓÚDRL µÄÁ¿»¯Ä£ÐÍÏà±È£¬²ÉÓÃAutoQ Á¿»¯µÄͬһģÐÍÔÚʵÏÖÏàÍ¬ÍÆÀí¾«¶ÈµÄͬʱ£¬ÍÆÀíÑÓ³ÙÆ½¾ù½µµÍÁË54.06%£¬ÍÆÀíÄܺĽµµÍÁË50.69%¡£
19.¡¶SVQN: Sequential Variational Soft Q-Learning Networks¡· ¹Ø¼ü´Ê£ºPOMDP¡¢±ä·ÖÍÆÀí¡¢Éú³ÉÄ£ÐÍÁÁµã£ºSVQN ÔÚͳһͼģÐÍÏÂÐÎʽ»¯Òþ²Ø×´Ì¬ÍÆÀíºÍ×î´óìØÇ¿»¯Ñ§**£¬²¢ÁªºÏÓÅ»¯Á½¸öÄ£¿é¡£²¿·Ö¿É¹Û²ìÂí¶û¿É·ò¾ö²ß¹ý³Ì(POMDP) ÊÇÒ»ÖÖÁé»îµÄÄ£ÐÍ£¬ÔÚÏÖʵÊÀ½çµÄ¾ö²ßÓ¦ÓÃÖкÜÊÜ»¶Ó£¬ÕâЩӦÓÃÐèÒªÀ´×Ô¹ýÈ¥¹Û²ìµÄÐÅÏ¢À´×ö³ö×î¼Ñ¾ö²ß¡£ÓÃÓÚ½â¾öÂí¶û¿É·ò¾ö²ß¹ý³Ì(MDP) ÈÎÎñµÄ±ê׼ǿ»¯Ñ§**Ëã·¨²»Êʺϣ¬ÒòΪËüÃÇÎÞ·¨ÍƶÏδ¹Û²ìµ½µÄ״̬¡£ÔÚ±¾ÎÄÖУ¬ÎÒÃÇÔÚͳһͼģÐÍÏÂÐÎʽ»¯ÁËÒþ²Ø×´Ì¬ÍÆÀíºÍ×î´óìØÇ¿»¯Ñ§**£¨MERL£©£¬²¢¿ª·¢ÁËÒ»ÖÖеÄPOMDP Ëã·¨£¬ÁªºÏÓÅ»¯ÕâÁ½¸öÄ£¿é¡ª¡ª Ìá³öÁËÐòÁбä·ÖÈíQ ѧ**ÍøÂ磨SVQN£©¡£´ËÍ⣬ÎÒÃÇÉè¼ÆÁËÉî¶ÈÑ»·Éñ¾ÍøÂçÀ´½µµÍËã·¨µÄ¼ÆË㸴ÔÓ¶È¡£ÊµÑé½á¹û±íÃ÷£¬SVQN ÀûÓùýÈ¥µÄÐÅÏ¢À´Ö§³Ö¾ö²ß£¬½øÐÐÓÐЧµÄÍÆÀí£¬²¢ÇÒÔÚһЩÀ§ÄÑÈÎÎñÉÏÓÅÓÚÆäËû»ùÏß¡£ÎÒÃǵÄÏûÈÚÑо¿±íÃ÷£¬SVQN ¾ßÓÐËæÊ±¼äÍÆÒÆ½øÐзº»¯µÄÄÜÁ¦£¬²¢ÇҶԹ۲쵽µÄÈŶ¯¾ßÓг°ôÐÔ¡£
19.¡¶Observational Overfitting in Reinforcement Learning¡· ¹Ø¼ü´Ê£º¹Û²ì¡¢¹ýÄâºÏ¡¢·º»¯¡¢Òþʽ¡¢ÕýÔò»¯¡¢¹ý²ÎÊý»¯ÁÁµã£º±¾ÎÄÌá³öÁËÒ»ÖÖ·ÖÎöRL ״̬¿Õ¼ä²»Ïà¹Ø²¿·Ö¹ýÄâºÏµÄ·½·¨£¬²¢ÇÒÎÒÃÇÌá³öÁËÒ»ÖÖ²âÁ¿Îó²î¿ò¼ÜµÄ·½·¨¡£ÎÞÄ£ÐÍÇ¿»¯Ñ§**(RL) Öйý¶ÈÄâºÏµÄÖ÷ÒªÖ¢×´£º´úÀí¿ÉÄÜ»á¸ù¾ÝÂí¶û¿É·ò¾ö²ß¹ý³Ì(MDP) Éú³ÉµÄ¹Û²ì½á¹û´íÎ󵨽«½±ÀøÓëijЩÐé¼ÙÌØÕ÷¹ØÁªÆðÀ´¡£ÎÒÃÇÌṩÁËÒ»¸ö·ÖÎöÕâÖÖÇé¿öµÄ×ÜÌå¿ò¼Ü¡£ÕâÓÃÓÚͨ¹ý¼òµ¥µØ¸Ä±äMDP µÄ¹Û²ì¿Õ¼äÀ´Éè¼Æ¶à¸ö×ۺϻù×¼¡£µ±´úÀí¹ý¶ÈÄ£Äâʱ
ºÏµ½²»Í¬µÄ¹Û²ì¿Õ¼äʱ£¬¼´Ê¹µ×²ãµÄMDP¶¯Ì¬Êǹ̶¨µÄ£¬ÎÒÃÇÈÔ³ÆÖ®Îª¹Û²ì¹ý¶ÈÄâºÏ¡£ÎÒÃǵÄʵÑé½ÒʾÁËһЩÓÐȤÊôÐÔ£¨ÓÈÆäÔÚÒþʽÕýÔò»¯·½Ã棩£¬²¢Ö¤ÊµÁËÒÔǰÔÚRL·º»¯ºÍ¼à¶½Ñ§**£¨SL£©ÖеŤ×÷½á¹û¡£ 20.¡¶Multi-agent Reinforcement Learning for Networked System Control¡·¹Ø¼ü´Ê£ºmulti-agent reinforcement learning, decision and controlHIGHLIGHT£º±¾ÎÄÕë¶ÔÍøÂ绯¶àÖÇÄÜÌå¿ØÖÆÎÊÌâÌá³öÁËÐÂÌá·¨ºÍеÄͨÐÅÐÒé¡£±¾ÎÄ¿¼ÂÇÁËÍøÂçϵͳ¿ØÖÆÖеĶàÖÇÄÜÌåÇ¿»¯Ñ§**£¨MARL£©¡£¾ßÌåÀ´Ëµ£¬Ã¿¸öÖÇÄÜÌå¶¼»ùÓÚ±¾µØ¹Û²ìºÍÀ´×ÔÏàÁÚÁÚ¾ÓµÄÏûÏ¢À´Ñ§**·ÖÉ¢¿ØÖƲßÂÔ¡£ÎÒÃǽ«ÕâÖÖÍøÂ绯µÄMARL£¨NMARL£©ÎÊÌ⹫ʽ»¯ÎªÊ±¿ÕÂí¶û¿É·ò¾ö²ß¹ý³Ì£¬²¢ÒýÈë¿Õ¼äÕÛ¿ÛÒò×ÓÀ´Îȶ¨Ã¿¸ö±¾µØAgentµÄѵÁ·¡£´ËÍ⣬ÎÒÃÇÌá³öÁËÒ»ÖÖеĿÉ΢·ÖͨÐÅÐÒ飬³ÆÎªNeurComm£¬ÒÔ¼õÉÙNMARLÖеÄÐÅÏ¢¶ªÊ§ºÍ·ÇƽÎÈÐÔ¡£ÔÚʵ¼ÊµÄNMARL×ÔÊÊÓ¦½»Í¨ÐźſØÖƺÍÐͬ×ÔÊÊӦѲº½¿ØÖƳ¡¾°ÏµÄʵÑé»ù´¡ÉÏ£¬Êʵ±µÄ¿Õ¼äÕÛÏÖÒò×Ó¿ÉÒÔÓÐЧµØÔöÇ¿·ÇͨÐÅMARLËã·¨µÄѧ**ÇúÏߣ¬´úÂ룺https://github.com/cts198859/deeprl_network 21.¡¶Learning the Arrow of Time for Problems in Reinforcement Learning¡·¹Ø¼ü´Ê£ºArrow of Time, AI-SafetyHIGHLIGHT£ºÎÒÃÇÑо¿ÁËMDPµÄArrow of Time£¬ÓÃËüÀ´ºâÁ¿¿É¼°ÐÔ£¬¼ì²â¸±×÷Óò¢»ñµÃºÃÆæÐĽ±ÀøÐźš£ÈËÀà¶Ôʱ¼äµÄ²»¶Ô³Æ·¢Õ¹ÓÐ×ÅÌìÉúµÄÀí½â£¬ÎÒÃÇ¿ÉÓÃËüÀ´¸ßЧ¡¢°²È«µØ¸ÐÖªºÍ²Ù×Ý»·¾³¡£ÊÜ´ËÆô·¢£¬ÎÒÃǽâ¾öÁËÔÚÂí¶û¿É·ò£¨¾ö²ß£©¹ý³ÌÖÐѧ**Arrow of TimeµÄÎÊÌâ¡£ÎÒÃǽ«ËµÃ÷ѧ**µÄArrow of TimeÈçºÎ²¶»ñÓйػ·¾³µÄÖØÒªÐÅÏ¢£¬ÕâЩÐÅÏ¢ÓÖ¿ÉÒÔÓÃÓÚºâÁ¿¿É´ïÐÔ£¬¼ì²â¸±×÷Óò¢»ñµÃÄÚÔڵĽ±ÀøÐźš£×îºó£¬ÎÒÃÇÌá³öÒ»ÖÖ¼òµ¥ÓÐЧµÄËã·¨À´²ÎÊý»¯µ±Ç°ÎÊÌ⣬²¢Ê¹Óú¯Êý±Æ½üÆ÷£¨´Ë´¦ÎªÉî¶ÈÉñ¾ÍøÂ磩ѧ**Arrow of Time¡£ÎÒÃǵľÑé½á¹ûº¸ÇÁËÀëÉ¢ºÍÁ¬Ðø»·¾³µÄÑ¡Ôñ£¬´úÂ룺 https://www.sendspace.com/file/0mx0en 22.¡¶Reinforcement Learning with Competitive Ensembles of Information-Constrained Primitives¡·¹Ø¼ü´Ê£ºVariational Information Bottleneck, Learning primitivesHIGHLIGHT£ºÑ§**ÒþʽµÄÖ÷²ßÂÔ£¬ÒòΪHRLÖеÄÖ÷²ßÂÔ¿ÉÄÜÎÞ·¨Íƹ㡣ÓëÐí¶àµ±Ç°µÄ·Ö²ãÇ¿»¯Ñ§**·½·¨Ïà·´£¬×÷ÕßÌá³öÁËÒ»ÖÖѧ**µÍ¼¶²ßÂÔµÄÈ¥ÖÐÐÄ»¯·½·¨£¬ÕâЩµÍ¼¶²ßÂÔ×Ô¼º¾ö¶¨ÊÇ·ñÔÚµ±Ç°×´Ì¬ÏÂÐж¯£¬¶ø²»ÊÇÓÉÒ»¸öÖÐÐÄ»¯µÄ¸ü¸ß¼¶±ðµÄÔª²ßÂÔÔڵͼ¶²ßÂÔÖ®¼ä½øÐÐÑ¡Ôñ¡£ ·Ö²ãÇ¿»¯Ñ§**½«²ßÂÔ·Ö½âΪ½ÏµÍ¼¶±ðµÄÔÓï»òoption£¬ÒÔ¼°½«½Ï¸ß¼¶±ðµÄÔª²ßÂÔ·Ö½âΪÕë¶Ô¸ø¶¨Çé¿ö´¥·¢Êʵ±ÐÐΪµÄ²ßÂÔ¡£µ«ÊÇ£¬Ôª²ßÂÔÈÔ±ØÐëÔÚËùÓÐ״̬ÖÐ×ö³öÊʵ±µÄ¾ö¶¨¡£ÔÚÕâÏ×÷ÖУ¬ÎÒÃÇÌá³öÁËÒ»ÖÖ¿É·Ö½âΪ²»Í¬ÔÓﵫûÓи߼¶Ôª²ßÂԵIJßÂÔÉè¼Æ¡£Ã¿¸öÔÓï¿É×Ô¼º¾ö¶¨ÊÇ·ñÏ£ÍûÔÚµ±Ç°×´Ì¬ÏÂÖ´ÐвÙ×÷¡£ÎÒÃÇʹÓÃÐÅÏ¢ÀíÂÛ»úÖÆÀ´ÊµÏÖ´Ë·ÖÉ¢¾ö²ß£ºÃ¿¸öÔÓï¶¼»áÑ¡ÔñÐèÒª¶àÉÙÓйص±Ç°×´Ì¬µÄÐÅÏ¢À´×ö³ö¾ö¶¨£¬Ò»°ãÀ´ËµÔÓï»áÏ£ÍûÇëÇóÓйص±Ç°×´Ì¬µÄ×î¶àÐÅÏ¢¡£µ«±È½ÏÐÅÏ¢ÓÐÏÞÂÓÐʱÎÒÃÇÐèÒª¶ÔÔÓï½øÐÐregularizationÒÔʹÓþ¡¿ÉÄÜÉÙµÄÐÅÏ¢£¬Õâ»áµ¼ÖÂ×ÔÈ»¾ºÕùºÍרҵ»¯¡£ÎÒÃÇͨ¹ýʵÑéÖ¤Ã÷£¬ÐµIJßÂÔÌåϵ½á¹¹ÔÚ·º»¯·½Ãæ±Èflat²ßÂԺͷֲã²ßÂÔ¶¼ÓÐËù¸Ä½ø¡£ 23.¡¶Exploration in Reinforcement Learning with Deep Covering Options¡·£¨poster£©¹Ø¼ü´Ê£ºtemporal abstraction, explorationHIGHLIGHT£ºÎÒÃǽéÉÜÁËÒ»ÖÖ¿É×Ô¶¯·¢ÏÖtask-agnostic options£¬´Ó¶ø¹ÄÀøÇ¿»¯Ñ§**ÖеÄ̽Ë÷µÄ·½·¨¡£Ä¿Ç°¼ÓËÙÇ¿»¯Ñ§**ÖеÄ̽Ë÷µÄ·½·¨³£³£ÊÇÆô·¢Ê½µÄ¡£½üÄêÀ´£¬Ñо¿ÕßÌá³öÁËcovering optionsÒÔ·¢ÏÖÒ»×é¿ÉÖ¤Ã÷µØ¼õÉÙ»·¾³¸²¸Çʱ¼äÉÏÏÞµÄoptions£¬ÕâÊÇ̽Ë÷ÄѶȵÄÒ»ÖÖ¶ÈÁ¿¡£Covering optionsÊÇʹÓÃͼÀÆÕÀ˹ͼµÄÌØÕ÷ÏòÁ¿¼ÆËãµÄ£¬µ«ËüÃÇÊÜÖÆÓÚ±í¸ñÈÎÎñ£¬²»ÊÊÓÃÓÚ¾ßÓнϴó»òÁ¬Ðø×´Ì¬¿Õ¼äµÄÈÎÎñ¡£¶Ô´Ë£¬ÎÒÃǽéÉÜÁËdeep covering options£¬ÕâÊÇÒ»ÖÖÔÚÏß·½·¨£¬¿É½«¸²¸Ç·¶Î§À©Õ¹µ½´óÐÍ״̬¿Õ¼ä£¬×Ô¶¯·¢ÏÖtask-agnostic optionsÒÔ¹ÄÀøÌ½Ë÷¡£ 24.¡¶Logic and the 2-Simplicial Transformer¡·£¨poster£©¹Ø¼ü´Ê£ºtransformer, logic, reasoningHIGHLIGHT£ºÎÒÃǽéÉÜÁË2-simplicial Transformer£¬ËüÊÇTransformerµÄÀ©Õ¹£¬°üÀ¨ÁËÒ»ÖÖ·º»¯µã»ý×¢ÒâÁ¦µÄ¸ßάעÒâÁ¦ÐÎʽ£¬²¢Ê¹ÓÃÕâÖÖ×¢ÒâÁ¦À´¸üÐÂʵÌå±íÕ÷Óë¼ÛÖµÏòÁ¿µÄÕÅÁ¿»ý¡£ÎÒÃDZíÃ÷£¬ÕâÖּܹ¹ÊÇÉî¶ÈÇ¿»¯Ñ§**±³¾°ÏÂÂß¼ÍÆÀíµÄÒ»¸öÓÐÓõĹéÄÉÆ«Ïò¡£Review£º±¾ÎÄÀ©Õ¹ÁËTransformer£¬ÊµÏÖÁ˸ßάעÒâ»úÖÆ£¬½«µã»ý×¢ÒâÍÆ¹ãÁË¡£Reviewer3ÈÏΪ£¬½«×¢Òâ»úÖÆ´Ó¶þ½×¹ØÏµÀ©Õ¹µ½Èý½×¹ØÏµÊÇÒ»¸öÖØÒªµÄÌáÉý£¬mathematical context¾ßÓж´²ìÁ¦£¬ÇÒ¿ÉÄܵ¼Ö½øÒ»²½µÄDZÔÚ·¢Õ¹¡£´úÂ룺https://github.com/dmurfet/2simplicialtransformer 25.¡¶Watch, Try, Learn: Meta-Learning from Demonstrations and Rewards¡·¹Ø¼ü´Ê£ºmeta-learning, imitation learningHIGHLIGHT£º±¾ÎÄÌá³öÁËÒ»ÖÖԪѧ**·½·¨£¬¸Ã·½·¨¿ÉÒÔ´ÓDemonstrationsºÍºóÐøµÄRLÈÎÎñÖÐѧ**¡£Ä£·Âѧ**ʹÖÇÄÜÌå¿É´ÓÑÝʾÖÐѧ**¸´ÔÓµÄÐÐΪ¡£µ«ÊÇ£¬Ñ§**»ùÓÚÊÓ¾õµÄ¸´ÔÓÈÎÎñ¿ÉÄÜÐèÒª²»ÇÐʵ¼ÊµÄDemonstrations¡£ÔªÄ£·Âѧ**ÊÇÒ»ÖÖÓÐǰ;µÄ·½·¨£¬Ëü¿ÉʹÖÇÄÜÌåͨ¹ýÀûÓÃѧ**ÀàËÆÈÎÎñµÄ¾Ñ飬´ÓÒ»¸ö»ò¼¸¸öDemonstrationsÖÐѧ**ÐÂÈÎÎñ¡£ÔÚÈÎÎñÄ£ÀâÁ½¿É»ò¹Û²ì²»µ½¶¯Ì¬µÄÇé¿öÏ£¬½öƾDemonstrations¿ÉÄÜÎÞ·¨Ìṩ×ã¹»µÄÐÅÏ¢¡£ÖÇÄÜÌ廹±ØÐë³¢ÊÔÖ´ÐÐÈÎÎñÒԳɹ¦ÍƶϲßÂÔ¡£ÔÚÕâÏ×÷ÖУ¬ÎÒÃÇÌá³öÁËÒ»ÖÖ¿ÉÒÔ´ÓDemonstrationsºÍ·´¸´ÊÔÑéµÄ¾ÑéÖÐѧ**²¢¾ßÓÐÏ¡Êè½±Àø·´À¡µÄ·½·¨¡£ÓëԪģ·ÂÏà±È£¬´Ë·½·¨Ê¹ÖÇÄÜÌåÄÜÓÐЧimprove itself autonomously beyond the demonstration data¡£ÓëԪǿ»¯Ñ§**Ïà±È£¬ÓÉÓÚDemonstrations¼õÇáÁË̽Ë÷¸ºµ££¬Òò´ËÎÒÃÇ¿ÉÒÔÀ©Õ¹µ½¸ü¹ã·ºµÄÈÎÎñ·ÖÅ䡣ʵÑé±íÃ÷£¬ÔÚһϵÁоßÓÐÌôÕ½ÐԵĻùÓÚÊÓ¾õµÄ¿ØÖÆÈÎÎñÉÏ£¬ÎÒÃǵķ½·¨Ã÷ÏÔÓÅÓÚÒÔǰµÄ·½·¨¡£´úÂ룺https://drive.google.com/open id=1f1LzO0fe1m-kINY8DTgL6JGimVGiQOuz 26.¡¶Adversarial Policies: Attacking Deep Reinforcement Learning¡·¹Ø¼ü´Ê£ºadversarial examples, security, multi-agentHIGHLIGHT£ºDRL²ßÂÔ¿ÉÄÜ»áÊܵ½ÆäËûÖÇÄÜÌå²ÉÈ¡Ðж¯ÒÔ´´½¨¾ßÓжԿ¹ÐÔµÄ×ÔÈ»¹Û²ìµÄ¹¥»÷¡£ÖÚËùÖÜÖª£¬DRL²ßÂÔÈÝÒ×Êܵ½Æä¹Û²âÖµµÄ¶Ô¿¹ÐÔÈŶ¯£¬ÀàËÆÓÚ·ÖÀàÆ÷µÄ¶Ô¿¹ÐÔÀý×Ó¡£È»¶ø£¬¹¥»÷Õßͨ³£ÎÞ·¨Ö±½ÓÐÞ¸ÄÁíÒ»¸öÖÇÄÜÌåµÄ¹Û²âÖµ¡£Õâ¿ÉÄܻᵼÖÂÈËÃÇ»³ÒÉ£ºÊÇ·ñÓпÉÄܽö½öͨ¹ýÑ¡ÔñÒ»¸öÔÚ¶àÖÇÄÜÌå»·¾³ÖÐ×÷ÓõĶԿ¹ÐÔ²ßÂÔÀ´¹¥»÷Ò»¸öRLÖÇÄÜÌ壬´Ó¶ø´´Ôì³ö¶Ô¿¹ÐÔµÄ×ÔÈ»¹Û²âÖµ£¿ÎÒÃÇÖ¤Ã÷ÁËÔÚ¾ßÓб¾Ìå¹Û²âµÄÄ£ÄâÈËÐλúÆ÷ÈËÖ®¼äµÄÁãºÍÓÎÏ·ÖдæÔÚ¶Ô¿¹ÐÔ²ßÂÔ£¬ËüÓÃÓÚ¶Ô¿¹Í¨¹ý×ÔÎÒÓÎϷѵÁ·³ÉµÄ×îÏȽøÊܺ¦Õߣ¬Ê¹Æä¶Ô¶ÔÊÖ¾ßÓг°ôÐÔ¡£¶Ô¿¹ÐÔ²ßÂÔ¿É¿¿µØÓ®ÁËÊܺ¦Õߣ¬µ«²úÉúÁË¿´ËÆËæ»úºÍ²»Ðµ÷µÄÐÐΪ¡£ÎÒÃÇ·¢ÏÖ£¬ÕâЩ²ßÂÔÔÚ¸ßά»·¾³Öиü³É¹¦£¬²¢ÔÚÊܺ¦Õß²ßÂÔÍøÂçÖÐÓÕµ¼³öÓëÊܺ¦ÕßºÍÆÕͨ¶ÔÊÖ¶ÔÞÄʱʵÖʲ»Í¬µÄ¼¤»î¡£ÊÓÆµ¼ûadversarialpolicies.github.io ¡£´úÂ룺https://github.com/humancompatibleai/adversarial-policies 27.¡¶Population-Guided Parallel Policy Search for Reinforcement Learning¡·¹Ø¼ü´Ê£ºParallel Learning, Population Based LearningHIGHLIGHT£º±¾ÎÄÌá³öÁËÒ»ÖÖmulti-actor RLµÄз½·¨£¬¸Ã·½·¨Í¨¹ýÒÔÈáºÍµÄ·½Ê½ÌáÁ¶±íÏÖ×î¼ÑµÄÖÇÄÜÌåµÄ²ßÂÔ²¢ÔÚÖÇÄÜÌåÖ®¼ä±£³ÖÒ»¶¨¾àÀëÀ´È·±£½ÇɫȺÌåµÄ¶àÑùÐԺͼ¨Ð§¡£×÷ÕßÏÔʾ£¬Ó뼸ÖÖ×îÏȽøµÄµ¥actorËã·¨ºÍÆäËû¼¸ÖÖmulti-actor RLËã·¨Ïà±È£¬ÐÂËã·¨ÐÔÄÜÓÐËù¸ÄÉÆ¡£ ±¾ÎÄÌá³öÁËÒ»ÖÖеÄÒÔPopulationΪµ¼ÏòµÄ²¢ÐÐѧ**·½°¸£¬ÒÔÌá¸ßoff-policyÇ¿»¯Ñ§**£¨RL£©µÄÐÔÄÜ¡£ÔÚз½°¸ÖУ¬¾ßÓÐÏàͬ¼ÛÖµº¯ÊýºÍ²ßÂԵĶà¸öÏàͬµÄѧ**Õß¹²ÏíÒ»¸ö¾ÑéÖØ²¥»º³åÇø£¬²¢ÔÚ×î¼Ñ²ßÂÔÐÅÏ¢µÄÖ¸µ¼ÏÂÐ×÷ËÑË÷Ò»¸öºÃµÄ²ßÂÔ¡£¹Ø¼üÊÇͨ¹ý¹¹½¨ÓÃÓÚ²ßÂÔ¸üеÄÔöÇ¿Ëðʧº¯ÊýÒÔÀ©´ó¶à¸öѧ**ÕßµÄÕûÌåËÑË÷·¶Î§£¬´Ó¶øÒÔÒ»ÖÖ**soft**µÄ·½Ê½**ÈÚºÏ×î¼Ñ²ßÂÔµÄÐÅÏ¢**¡£Í¨¹ýÏÈǰ×î¼Ñ²ßÂÔµÄÖ¸µ¼ºÍÀ©´ó·¶Î§£¬ÎÒÃÇ¿ÉÒÔ¸ü¿ì¸üºÃµØ½øÐвßÂÔËÑË÷£¬²¢ÇÒ´ÓÀíÂÛÉÏÖ¤Ã÷ËùÌá³ö·½°¸µÄÀÛ»ý»Ø±¨ÆÚÍûµÄµ¥µ÷Ìá¸ß¡£ 28.¡¶Learning Efficient Parameter Server Synchronization Policies for Distributed SGD¡·¹Ø¼ü´Ê£ºDistributed SGD, Paramter-Server, Synchronization PolicyHIGHLIGHT£ºÎÒÃDzÉÓûùÓÚÇ¿»¯Ñ§**µÄ·½·¨À´Ñ§**ÓÃÓÚParameter Server-based distributed training of SGDµÄ×î¼Ñͬ²½²ßÂÔ¡£ÎÒÃÇÓ¦ÓûùÓÚÇ¿»¯Ñ§**µÄ·½·¨À´Ñ§**×î¼Ñͬ²½²ßÂÔ£¬¸Ã²ßÂÔÓÃÓÚParameter Server-based distributed training of SGD¡£Í¨¹ýÔÚPSÉèÖÃÖÐʹÓÃÕýʽµÄͬ²½²ßÂÔ£¬ÎÒÃÇÄܹ»µÃ³ö״̬ºÍ¶¯×÷µÄºÏÊÊÇÒ½ô´ÕµÄÃèÊö£¬´Ó¶øÊ¹Óñê×¼µÄÏÖ³ÉDQNËã·¨¡£½á¹û£¬ÎÒÃÇÄܹ»Ñ§**ÊÊÓÃÓÚ²»Í¬¼¯Èº»·¾³£¬²»Í¬ÑµÁ·Êý¾Ý¼¯ºÍ½ÏСģÐͱ仯µÄͬ²½²ßÂÔ£¬²¢ÇÒ£¨×îÖØÒªµÄÊÇ£©Óë±ê×¼²ßÂÔ£¨ÈçÅúÁ¿Í¬²½²¢ÐУ¨BSP£©£¬Òì²½²¢ÐУ¨ASP£©»ò³Â¾ÉµÄͬ²½²¢ÐУ¨SSP£©£©Ïà±È£¬ÐÂÄ£ÐÍ´ó´ó¼õÉÙÁËѵÁ·Ê±¼äÇÒѧ**µ½µÄ²ßÂÔÆÕ±éÊÊÓÃÓÚ¶àÖÖunseen cases¡£ 29.¡¶Finding and Visualizing Weaknesses of Deep Reinforcement Learning Agents¡·¹Ø¼ü´Ê£ºVisualization, SafetyHIGHLIGHT£ºÎÒÃÇÉú³É¾¹ýѵÁ·µÄRLËã·¨µÄÁÙ½ç״̬£¬ÒÔ¿ÉÊÓ»¯Ç±ÔÚµÄȱÏÝ¡£Ëæ×ÅÓÉÊÓ¾õ¸ÐÖªÇý¶¯µÄÉî¶ÈÇ¿»¯Ñ§**±äµÃÔ½À´Ô½¹ã·º£¬ÎÒÃÇÔ½À´Ô½ÐèÒª¸üºÃµØÀí½âºÍ̽¾¿Ëùѧ**µÄÖÇÄÜÌå¡£Á˽â¾ö²ß¹ý³Ì¼°ÆäÓëÊÓ¾õÊäÈëµÄ¹ØÏµ¶ÔÓÚʶ±ðѧ**ÐÐΪÖеÄÎÊÌâ·Ç³£ÓмÛÖµ¡£µ«ÊÇ£¬Õâ¸ö»°ÌâÔÚÑо¿½çÏà¶Ôδ±»³ä·ÖÑо¿¡£ÔÚÕâÏ×÷ÖУ¬ÎÒÃÇÌá³öÁËÒ»ÖÖΪÊܹýѵÁ·µÄÖÇÄÜÌåºÏ³É¸ÐÐËȤµÄÊÓ¾õÊäÈëµÄ·½·¨¡£ÕâÑùµÄÊäÈë»ò״̬¿ÉÄÜÊÇÐèÒª²ÉÈ¡ÌØ¶¨Ðж¯µÄÇé¿ö¡£´ËÍ⣬Äܹ»»ñµÃ·Ç³£¸ß/µÍ±¨³êµÄÁÙ½ç״̬ͨ³£¶ÔÓÚÀí½âϵͳµÄÌ¬ÊÆ¸ÐÖªÓаïÖú£¬ÒòΪËüÃǿɶÔÓ¦ÓÚΣÏÕ״̬¡£Îª´Ë£¬ÎÒÃÇѧ**ÁË»·¾³×´Ì¬¿Õ¼äÉϵÄÉú³ÉÄ£ÐÍ£¬²¢Ê¹ÓÃÆäDZÔÚ¿Õ¼äΪĿ±ê״̬ÓÅ»¯ÁËÄ¿±êº¯Êý¡£ÊµÑéÖУ¬ÎÒÃÇÖ¤Ã÷ÁËÕâÖÖ·½·¨¿ÉΪ¸÷ÖÖ»·¾³ºÍÇ¿»¯Ñ§**·½·¨Ìṩinsights¡£ÎÒÃÇÔÚ±ê×¼µÄAtari»ù×¼ÓÎÏ·ÒÔ¼°×Ô¶¯¼ÝʻģÄâÆ÷ÖÐ̽Ë÷½á¹û£¬·¢ÏÖÐÂËã·¨Äܹ»¼Ó¿ìʶ±ðÐÐΪȱÏݵÄЧÂÊ¡£ÎÒÃÇÏàÐÅÕâÖÖͨÓ÷½·¨¿É×÷ΪAI°²È«µÄÖØÒª¹¤¾ß¡£ 30.¡¶Option Discovery using Deep Skill Chaining ¡·¹Ø¼ü´Ê£ºHierarchical Reinforcement Learning, Skill Discovery, Deep LearningHIGHLIGHT£ºÎÒÃÇÌá³öÁËÒ»ÖÖеIJã´ÎÇ¿»¯Ñ§**Ëã·¨£¬¸ÃËã·¨±È·Ç²ã´ÎÖÇÄÜÌåºÍÆäËû×îеļ¼ÄÜ·¢ÏÖ¼¼Êõ¸ü¿É¿¿µØ½â¾öÁËÃæÏò¸ßά¶ÈÄ¿±êµÄÈÎÎñ¡£×ÔÖ÷·¢ÏÖÔÚʱ¼äÉÏÀ©Õ¹µÄ¶¯×÷»ò¼¼ÄÜÊÇ·Ö²ãÇ¿»¯Ñ§**µÄ³¤ÆÚÄ¿±ê¡£ÎÒÃÇÌá³öÁËÒ»ÖÖ½«¼¼ÄÜÁ´ÓëDNNÏà½áºÏµÄÐÂËã·¨£¬ÒÔ×ÔÖ÷·¢ÏÖ¸ßάÁ¬ÐøÁìÓòÖеļ¼ÄÜ¡£×îÖÕµÄËã·¨£¬¼´Éî²ã´ÎµÄ¼¼ÄÜÁ´£¬¿Éͨ¹ýÖ´ÐÐÒ»ÖÖÌØÐÔ´Ó¶øÊ¹ÖÇÄÜÌåÄܹ»Ö´ÐÐÁíÒ»ÖÖÌØÐÔÀ´¹¹½¨¼¼ÄÜ¡£ÎÒÃÇÖ¤Ã÷£¬ÔÚÌôÕ½ÐÔµÄÁ¬Ðø¿ØÖÆÈÎÎñÖУ¬Éî¶È¼¼ÄÜÁ´ÏÔ×ÅÓÅÓڷDzã´ÎÖÇÄÜÌåºÍÆäËû×îм¼ÄÜ·¢ÏÖ¼¼Êõ¡£´úÂë: https://github.com/deep-skill-chaining/deep-skill-chaining 31.¡¶Dynamical Distance Learning for Semi-Supervised and Unsupervised Skill Discovery¡·¹Ø¼ü´Ê£ºsemi-supervised learning, unsupervised learning, robotics, deep learningHIGHLIGHT£ºÎÒÃÇչʾÁËÈçºÎÔÚÇ¿»¯Ñ§**»·¾³ÖÐ×Ô¶¯Ñ§**¶¯Ì¬¾àÀ룬²¢Ê¹ÓÃËüÃÇÀ´ÌṩÐÎ×´Á¼ºÃµÄ½±Àøº¯Êý£¬ÒÔʵÏÖеÄÄ¿±ê¡£Ç¿»¯Ñ§**ÐèÒªÊÖ¶¯Ö¸¶¨½±Àøº¯Êý²ÅÄÜѧ**ÈÎÎñ¡£ËäÈ»ÔÔòÉϸý±Àøº¯Êý½öÐèÖ¸¶¨ÈÎÎñÄ¿±ê£¬µ«ÔÚʵ¼ùÖУ¬Ç¿»¯Ñ§**¿ÉÄܷdz£ºÄʱÉõÖÁ²»¿ÉÐУ¬³ý·Ç½±Àøº¯ÊýµÄÐÎ×´Äܹ»Îª³É¹¦µÄ½á¹ûÌṩһ¸öƽ»¬µÄÌݶȡ£ÎÒÃǺÜÄÑÊÖ¶¯Ö¸¶¨´Ëshaping£¬Ìر𵱴ÓÔʼ¹Û²ìÖµ£¨ÀýÈçͼÏñ£©ÖÐѧ**ÈÎÎñʱ¡£ÔÚ±¾ÎÄÖУ¬ÎÒÃÇÑо¿ÁËÈçºÎ×Ô¶¯Ñ§**¶¯Ì¬¾àÀ룺ºâÁ¿´ÓÈÎºÎÆäËû״̬µ½¸ø¶¨Ä¿±ê״̬µÄÔ¤ÆÚʱ¼ä²½³¤µÄÁ¿¶È¡£ÕâЩ¶¯Ì¬¾àÀë¿ÉÓÃÓÚÌṩÐÎ×´Á¼ºÃµÄ½±Àøº¯Êý£¬ÒÔʵÏÖеÄÄ¿±ê£¬´Ó¶øÓпÉÄÜÓÐЧµØÑ§**¸´ÔÓÈÎÎñ¡£ÎÒÃDZíÃ÷¶¯Ì¬¾àÀë¿ÉÒÔÓÃÓÚ°ë¼à¶½×´Ì¬£¬ÆäÖÐÓë»·¾³µÄÎ޼ල½»»¥ÓÃÓÚѧ**¶¯Ì¬¾àÀ룬¶øÉÙÁ¿µÄÆ«ºÃ¼à¶½ÓÃÓÚÈ·¶¨ÈÎÎñÄ¿±ê£¬¶øÎÞÐèÈκÎÈ˹¤Éè¼ÆµÄ½±Àøº¯Êý»òÄ¿±êʾÀý¡£ÎÒÃÇÔÚÕæÊµÊÀ½çµÄ»úÆ÷È˺ͷÂÕæÖж¼ÆÀ¹ÀÁËз½·¨¡£ÎÒÃÇչʾÁËз½·¨¿ÉÒÔʹÓÃÔʼµÄÓÐ9¸ö×ÔÓɶȵÄÊÖÀ´Ñ§**ÈçºÎת¶¯·§ÃÅ£¨Ê¹ÓÃÔʼͼÏñ¹Û²ì½á¹ûºÍÊ®¸öÆ«ºÃ±êÇ©£¬¶øÎÞÐèÈÎºÎÆäËû¼à¶½£©¡£Ñ§**¼¼ÄܵÄÊÓÆµ¼û£º https://sites.google.com/view/dynamical-distance-learning 32.¡¶Reinforced active learning for image segmentation¡·¹Ø¼ü´Ê£ºsemantic segmentation, active learningHIGHLIGHT£ºÍ¨¹ýÇ¿»¯Ñ§**À´Ñ§**±êÇ©²ßÂÔ£¬ÒÔ¼õÉÙÓïÒå·Ö¸îÈÎÎñµÄ±êÇ©¹¤×÷Á¿¡£»ùÓÚѧ**µÄÓïÒå·Ö¸î·½·¨ÓÐÁ½¸ö¹ÌÓÐÌôÕ½¡£Ê×ÏÈ£¬»ñÈ¡element-wiseµÄ±êÇ©Êǰº¹óºÍºÄʱµÄ¡£µÚ¶þ£¬ÏÖʵµÄ·Ö¸îÊý¾Ý¼¯ÊǸ߶Ȳ»Æ½ºâµÄ£ºÒ»Ð©Àà±ð±ÈÆäËûÀà±ð·á¸»µÃ¶à£¬Ê¹ÐÔÄÜÆ«ÏòÓÚ×î¾ß´ú±íÐÔµÄÀà±ð¡£ÔÚ±¾ÎÄÖУ¬ÎÒÃǸÐÐËȤµÄÊǽ«ÈËÀàµÄ±êÇ©¹¤×÷¼¯ÖÐÔÚÒ»¸ö¸ü´óµÄÊý¾Ý³ØÖеÄС×Ó¼¯ÉÏ£¬×îС»¯±êÇ©¹¤×÷ËùÐèŬÁ¦£¬Í¬Ê±×î´ó»¯·Ö¸îÄ£ÐÍÔÚ±£³Öhold-out setÉϵÄÐÔÄÜ¡£ÎÒÃÇÌá³öÁËÒ»ÖÖеĻùÓÚDRLµÄÓïÒå·Ö¸îµÄÖ÷¶¯Ñ§**²ßÂÔ¡£Ò»¸öagentѧ**Ò»¸ö²ßÂÔ£¬´ÓÒ»¸öδ±ê¼ÇµÄÊý¾Ý³ØÖÐÑ¡ÔñÒ»¸öСµÄÐÅϢͼÏñÇøÓò×Ó¼¯¨C£¨ÓëÕû¸öͼÏñÏà¶Ô£©¨C½øÐбê¼Ç¡£ÇøÓòÑ¡Ôñ¾ö¶¨ÊÇ»ùÓÚÕýÔÚѵÁ·µÄ·Ö¸îÄ£Ð͵ÄÔ¤²âºÍ²»È·¶¨ÐÔ×ö³öµÄ¡£Ð·½·¨Ìá³öÁËÒ»ÖÖеÄÖ÷¶¯Ñ§**µÄDQN¹«Ê½µÄÐ޸ģ¬Ê¹ÆäÊÊÓ¦ÓïÒå·Ö¸îÎÊÌâµÄ´ó¹æÄ£ÐÔÖÊ¡£ÎÒÃÇÔÚCamVidÖвâÊÔÁ˸ÅÄîÖ¤Ã÷£¬²¢ÔÚ´ó¹æÄ£Êý¾Ý¼¯CityscapesÖÐÌṩÁ˽á¹û¡£ÔÚCityscapesÖУ¬ÎÒÃǵÄRL region-based DQN·½·¨±È×îÓоºÕùÁ¦µÄ»ùÏßËùÐèµÄ¶îÍâ±ê¼ÇÊý¾ÝÉÙÁË´óÔ¼30%¶øÐÔÄÜÏàͬ¡£´ËÍ⣬Óë»ùÏßÏà±È£¬ÎÒÃǵķ½·¨Ñ¯ÎÊÁ˸ü¶à´ú±íÐÔ²»×ãµÄÀà±ðµÄ±êÇ©£¬Ìá¸ßÁËËüÃǵÄÐÔÄÜ£¬²¢ÓÐÖúÓÚ»º½âÀ಻ƽºâÏÖÏó¡£ 32.¡¶CAQL: Continuous Action Q-Learning¡·¹Ø¼ü´Ê£ºDQN, Continuous control, Mixed-Integer Programming (MIP)HIGHLIGHT£ºÓÃÓÚ³ÖÐø¿ØÖƵĻùÓÚ¼ÛÖµµÄÇ¿»¯Ñ§**µÄÒ»°ã¿ò¼Ü¡£»ùÓÚ¼ÛÖµµÄÇ¿»¯Ñ§**·½·¨£¨ÈçQѧ**£©ÒÑÔÚ¸÷ÁìÓò£¨ÈçÓÎÏ·ºÍÍÆ¼öϵͳ£©ÖÐÈ¡µÃÁ˳ɹ¦¡£µ±¶¯×÷¿Õ¼äÓÐÏÞʱ£¬ÕâЩË㷨ͨ¹ýѧ**×îÓÅÖµº¯ÊýÒþʽµØÕÒµ½²ßÂÔ£¬Ð§¹û²»´í¡£µ«ÊÇ£¬À©Õ¹Qѧ**ÒÔ½â¾öÁ¬Ðø¶¯×÷RLÎÊÌâµÄÒ»¸öÖ÷ÒªÌôÕ½ÊÇ»ñµÃ×î¼ÑBellman backupÐèÒª½â¾öÁ¬Ðø¶¯×÷×î´ó»¯£¨max-Q£©ÎÊÌâ¡£ËäȻΪÁ˼ò»¯max-QÎÊÌ⣬ͨ³£ÏÞÖÆQº¯ÊýµÄ²ÎÊý»¯¹ØÓÚ¶¯×÷Êǰ¼µÄ£¬µ«ÕâÖÖÏÞÖÆ¿ÉÄܻᵼÖÂÐÔÄÜϽµ¡£¶øÇÒ£¬µ±Ê¹ÓÃͨÓÃǰÀ¡Éñ¾ÍøÂ磨NN£©¶ÔQº¯Êý½øÐвÎÊý»¯Ê±£¬max-QÎÊÌâ¿ÉÄÜÊÇNP-ÄÑÎÊÌâ¡£ÔÚÕâÏ×÷ÖÐÎÒÃÇÌá³öÁËCAQL·½·¨£¬¸Ã·½·¨Ê¹ÓÃQѧ**ºÍ¼¸¸ö¼´²å¼´Óõ͝×÷ÓÅ»¯Æ÷Ö®Ò»À´×îС»¯Bellman²Ð²î¡£ÌØ±ðµØ£¬ÀûÓÃDNNÖÐÓÅ»¯ÀíÂ۵Ľø²½£¬ÎÒÃDZíÃ÷¿ÉÒÔʹÓûìºÏÕûÊý±à³Ì£¨MIP£©À´×î¼Ñ½â¾ömax-QÎÊÌâ-µ±Qº¯Êý¾ßÓÐ×ã¹»µÄ±íʾÄÜÁ¦Ê±£¬ÕâÖÖ»ùÓÚMIPµÄÓÅ»¯ÓÕµ¼³ö¸üºÃµÄ²ßÂÔ£¬²¢ÇұȽüËÆÓÚmax-Q½â¾ö·½°¸µÄ¶ÔµÈ²ßÂÔ£¨ÈçCEM»òGA£©¸üÇ¿´ó¡£Îª¼Ó¿ìCAQLµÄÅàѵ£¬ÎÒÃÇ¿ª·¢ÁËÈýÖÖ¼¼Êõ£¨i£©¶¯Ì¬Èݲ£¨ii£©Ë«ÖعýÂ˺ͣ¨iii£©¾ÛÀࡣΪ¼Ó¿ìCAQLµÄinference£¬ÎÒÃÇÒýÈëÁËͬʱѧ**×îÓŲßÂÔµÄaction function¡£ÎªÖ¤Ã÷CAQLµÄÓÐЧÐÔ£¬ÎÒÃǽ«ÆäÓë×îеÄRLËã·¨ÔÚ¾ßÓв»Í¬³Ì¶È¶¯×÷Ô¼ÊøµÄ»ù×¼Á¬Ðø¿ØÖÆÎÊÌâÉϽøÐÐÁ˱Ƚϣ¬²¢±íÃ÷CAQLÔÚÑÏÖØÊÜÏ޵Ļ·¾³ÖÐÃ÷ÏÔÓÅÓÚ»ùÓÚ²ßÂԵķ½·¨¡£ 33.¡¶Learning Heuristics for Quantified Boolean Formulas through Reinforcement Learning¡· (Poster)¹Ø¼ü´Ê£ºLogic, QBF, Logical Reasoning, SAT, Graph, GNNHIGHLIGHT£ºÎÒÃÇʹÓÃRLÔÚ×îеÄQBFÇó½âÆ÷ÖÐ×Ô¶¯Ñ§**Óйع¤ÒµÎÊÌâµÄ·ÖÖ§Æô·¢·¨¡£ÎÒÃÇÑÝʾÁËÈçºÎͨ¹ýÉî¶ÈÇ¿»¯Ñ§**ΪÁ¿»¯µÄ²¼¶û¹«Ê½µÄ×Ô¶¯ÍÆÀíË㷨ѧ**ÓÐЧµÄÆô·¢Ê½Ëã·¨¡£ÎÒÃÇרעÓÚ»ØËÝËÑË÷Ëã·¨£¬¸ÃËã·¨ÒѾ¿ÉÒÔ½â¾öÁîÈËÓ¡ÏóÉî¿ÌµÄ¶à´ïÊýÊ®Íò±äÁ¿µÄ¹«Ê½¡£Ö÷ÒªÌôÕ½ÊÇÕÒµ½ÕâЩ¹«Ê½µÄ±íʾÐÎʽ£¬ÒÔʹÆä¿ÉÀ©Õ¹µØ½øÐÐÔ¤²â¡£¶ÔÓÚһϵÁоßÓÐÌôÕ½ÐÔµÄÎÊÌ⣬ÎÒÃÇѧ**ÁËÒ»ÖÖÆô·¢Ê½Ëã·¨£¬ÓëÏÖÓеÄÊÖдÆô·¢Ê½Ëã·¨Ïà±È£¬Ëü¿ÉÒÔ½â¾ö¸ü¶àµÄ¹«Ê½¡££¨PS£ºÕâÆªÔÚRL¼¼ÇÉÉϵ¹Ã»É¶ºÜÁÁµÄµã£¬Ö÷ÒªÊǰÑQuantified Boolean FormulasµÄ×Ô¶¯ÍÆÀíת»¯ÎªMDP¶û¶û~£© 34.¡¶AMRL: Aggregated Memory For Reinforcement Learning¡· (Poster)¹Ø¼ü´Ê£ºdeep learning, rl, memory, noise, machine learningHIGHLIGHT£ºÔÚDRLÖУ¬¿É½«order-invariantº¯ÊýÓë±ê×¼´æ´¢Ä£¿é½áºÏʹÓã¬ÒÔ¸ÄÉÆÌݶÈË¥¼õºÍ¿¹ÔëÉùÄÜÁ¦¡£ÔÚÐí¶à²¿·Ö¿É¹Û²ìµÄ·½°¸ÖУ¬RLÖÇÄÜÌ屨ÐëÒÀ¿¿³¤ÆÚ¼ÇÒä²ÅÄÜѧ**×î¼Ñ²ßÂÔ¡£ÎÒÃÇÖ¤Ã÷£¬ÓÉÓÚÀ´×Ô»·¾³ºÍ̽Ë÷µÄËæ»úÐÔ£¬Ê¹ÓÃÀ´×ÔNLPµÄ¼¼ÊõºÍ¼à¶½Ñ§**ÔÚRLÈÎÎñÉÏʧ°ÜÁË¡£ÀûÓÃÎÒÃǶÔRLÖд«Í³´æ´¢·½·¨¾ÖÏÞÐԵļû½â£¬ÎÒÃÇÌá³öÁËAMRL£¬ÕâÊÇÒ»Àà¿ÉÒÔѧ**¸üºÃµÄ²ßÂÔ¡¢¾ßÓиü¸ßµÄ²ÉÑùЧÂÊ£¬²¢ÇÒ¶ÔÔëÉùÊäÈë¾ßÓе¯ÐÔµÄÄ£ÐÍ¡£¾ßÌåÀ´Ëµ£¬ÎÒÃǵÄÄ£ÐÍʹÓñê×¼ÄÚ´æÄ£¿éÀ´×ܽá¶ÌÆÚcontext£¬È»ºó´Ó±ê׼ģÐÍÖлã×ÜËùÓÐÏÈǰ״̬£¬¶ø²»¿¼ÂÇ˳Ðò¡£ÎÒÃDZíÃ÷£¬ÕâÔÚÌݶÈË¥¼õºÍËæÊ±¼ä±ä»¯µÄÐÅÔë±È·½Ãæ¾ù¾ßÓÐÓÅÊÆ¡£ÎÒÃÇÔÚMinecraftºÍÃÔ¹¬»·¾³ÖнøÐÐÆÀ¹ÀÒÔ²âÊÔ³¤ÆÚ¼ÇÒ䣬 35.¡¶CM3: Cooperative Multi-goal Multi-stage Multi-agent Reinforcement Learning¡·¹Ø¼ü´Ê£ºmulti-agent reinforcement learningHIGHLIGHT£ºÒ»ÖÖÓÃÓÚÍêÈ«Ð×÷µÄ¶àÄ¿±ê¶àÖÇÄÜÌåÇ¿»¯Ñ§**µÄÄ£¿é»¯·½·¨£¬¸Ã·½·¨»ùÓڿγÌѧ**£¬¿É½øÐÐÓÐЧµÄ̽Ë÷²¢ÎªÐж¯Ä¿±ê»¥¶¯·ÖÅ书ÀÍ¡£ ¸÷ÖÖºÏ×÷µÄ¶àÖÇÄÜÌå¿ØÖÆÎÊÌâ¶¼ÒªÇóÖÇÄÜÌåÔÚʵÏÖ¸öÈËÄ¿±êµÄͬʱΪ¼¯ÌåµÄ³É¹¦×ö³ö¹±Ïס£ÕâÖÖ¶àÄ¿±ê¶àÖÇÄÜÌåµÄÉèÖøø×î½üµÄËã·¨´øÀ´ÁËÀ§ÄÑ£¬ÕâЩËã·¨Ö÷ÒªÕë¶Ôµ¥Ò»È«¾Ö½±ÀøµÄÉèÖã¬ËüÃÇÃæÁÙÁ½¸öÐÂÌôÕ½£ºÎªÑ§**¸öÈËÄ¿±êµÄʵÏÖºÍΪËûÈ˵ijɹ¦¶øºÏ×÷µÄ¸ßЧ̽Ë÷£¬ÒÔ¼°²»Í¬ÖÇÄÜÌåµÄÐж¯ºÍÄ¿±ê¼äµÄÏ໥×÷ÓõÄÐÅÓ÷ÖÅ䡣Ϊ½â¾öÕâÁ½¸öÌôÕ½£¬ÎÒÃǽ«ÎÊÌâÖØ¹¹ÎªÒ»¸öеÄÁ½½×¶Î¿Î³Ì£¬ÔÚѧ**¶àÖÇÄÜÌåºÏ×÷֮ǰ£¬ÏÈѧ**µ¥ÖÇÄÜÌåÄ¿±êµÄʵÏÖ£¬ÎÒÃÇÍÆµ¼³öÒ»¸öеĶàÄ¿±ê¶àÖÇÄÜÌå²ßÂÔÌݶȣ¬²¢²ÉÓÃÐÅÓú¯Êý½øÐоֲ¿ÐÅÓ÷ÖÅä¡£ÎÒÃÇʹÓú¯ÊýÔöÇ¿·½°¸À´ÏνӿγÌÖеļÛÖµºÍ²ßÂÔº¯Êý¡£±»³ÆÎªCM3µÄÍêÕû¼Ü¹¹ÔÚÈý¸ö¾ßÓÐÌôÕ½ÐԵĶàÄ¿±ê¶àÖÇÄÜÌåÎÊÌâÉϵÄѧ**ËÙ¶ÈÃ÷ÏÔ¿ìÓÚÏÖÓÐËã·¨µÄÖ±½Ó¸Ä±à£ºÀ§ÄѶÓÐÎÖеĺÏ×÷µ¼º½¡¢SUMO½»Í¨Ä£ÄâÆ÷ÖеĶ೵µÀ±ä»¯ÐÉÌÒÔ¼°ÌøÆå»·¾³ÖеÄÕ½ÂÔºÏ×÷¡£ 36.¡¶Toward Amortized Ranking-Critical Training For Collaborative Filtering¡·¹Ø¼ü´Ê£ºCollaborative Filtering, Recommender Systems, Actor-Critic, Learned MetricsHIGHLIGHT£ºÎÒÃÇÑо¿ÁË»ùÓÚactor-criticÇ¿»¯Ñ§**À´ÑµÁ·Ð×÷¹ýÂËÄ£Ð͵Äз½·¨£¬ÒÔ¸üÖ±½ÓµØ×î´ó»¯»ùÓÚÅÅÃûµÄÄ¿±êº¯ÊýÇÒÔÚ¸÷ÖÖDZ±äÁ¿Ä£ÐÍÖÐÌá¸ßÐÔÄÜ¡£¾ßÌåÀ´Ëµ£¬ÎÒÃÇѵÁ·criticÍøÂçÒÔ½üËÆ»ùÓÚÅÅÃûµÄÖ¸±ê£¬È»ºó¸üÐÂactorÍøÂçÒÔÕë¶Ôѧ**µÄÖ¸±êÖ±½Ó½øÐÐÓÅ»¯¡£Ó봫ͳµÄѧ**ÅÅÃû·½·¨ÐèÒªÖØÐÂÔËÐÐÐÂÁбíµÄÓÅ»¯³ÌÐòÏà±È£¬ÎÒÃÇ»ùÓÚcriticµÄ·½·¨Ê¹ÓÃÉñ¾ÍøÂç̯·ÖÆÀ·Ö¹ý³Ì£¬²¢¿ÉÖ±½ÓÌṩÐÂÁбíµÄ£¨½üËÆ£©ÅÅÃû·ÖÊý¡£ÎÒÃÇÖ¤Ã÷ÁËactor-criticÄܹ»ÏÔןÄÉÆ¸÷ÖÖÔ¤²âÄ£Ð͵ÄÐÔÄÜ£¬²¢ÔÚÈý¸ö´óÐÍÊý¾Ý¼¯ÉÏ´ïµ½Óë¸÷ÖÖÇ¿»ù×¼Ïà±È¸üºÃ»ò¿É±ÈµÄÐÔÄÜ¡£´úÂ룺https://github.com/samlobel/RaCT_CF 37.¡¶Chameleon: Adaptive Code Optimization For Expedited Deep Neural Network Compilation¡·¹Ø¼ü´Ê£ºLearning to Optimize, Compilers, Code Optimization, Neural Networks, ML for Systems, Learning for SystemsHIGHLIGHT£ºÇ¿»¯Ñ§**ºÍ×ÔÊÊÓ¦²ÉÑù£¬¿ÉÓÅ»¯Éî¶ÈÉñ¾ÍøÂçµÄ±àÒë¡£ÒԽ϶̵ıàÒëʱ¼äʵÏÖ¸ü¿ìµÄÖ´ÐÐËٶȿɴٽøÉñ¾ÍøÂçµÄ½øÒ»²½¶àÑùÐԺʹ´Ð¡£µ«ÊÇ£¬µ±Ç°Ö´ÐÐÉñ¾ÍøÂçµÄ·¶ÀýÒÀÀµÓÚÊÖ¶¯ÓÅ»¯µÄ¿â£¬´«Í³µÄ±àÒëÆô·¢·¨»ò×î½üµÄÒÅ´«Ëã·¨ºÍÆäËûËæ»ú·½·¨¡£ÕâЩ·½·¨ÐèҪƵ·±ÇÒ°º¹óµÄÓ²¼þ²âÁ¿£¬Òò¶ø²»½öÊ®·ÖºÄʱ¶øÇÒ´ÎÓÅ¡£¶Ô´Ë£¬ÎÒÃÇÉè¼ÆÁËÒ»ÖÖ½â¾ö·½°¸£¬Ëü¿ÉÒÔѧ**¿ìËÙÊÊÓ¦ÒÔǰ¿´²»µ½µÄÉè¼Æ¿Õ¼ä½øÐдúÂëÓÅ»¯£¬¼È¼Ó¿ìÁËËÑË÷ËÙ¶È£¬ÓÖÌá¸ßÁËÊä³öÐÔÄÜ¡£Õâ¸ö±»³ÆÎªChameleonµÄ·½°¸Ê¹ÓÃÁËÇ¿»¯Ñ§**·½·¨£¬·½°¸ÊÕÁ²ËùÐèµÄ²½Öè½ÏÉÙ¡£Chameleon»¹¿ª·¢ÁËÒ»ÖÖ×ÔÊÊÓ¦²ÉÑùËã·¨£¬²»½ö¹Ø×¢´ú±íÐÔµãÉϵݺ¹óÑù±¾£¨ÕæÊµµÄÓ²¼þ²âÁ¿£©£¬»¹Ê¹ÓÃÁìÓò֪ʶÆô·¢Âß¼À´¸Ä½øÑù±¾±¾Éí¡£Í¨¹ýʵ¼ÊÓ²¼þµÄʵÑé±íÃ÷£¬ChameleonÔÚÓÅ»¯Ê±¼äÉϱÈAutoTVMÌáËÙ4.45±¶£¬Í¬Ê±Ò²½«ÏÖ´úÉî¶ÈÍøÂçµÄÍÆÀíʱ¼äÌá¸ßÁË5.6%¡£ 38.¡¶Graph Constrained Reinforcement Learning for Natural Language Action Spaces¡·(Poster)¹Ø¼ü´Ê£ºnatural language generation, knowledge graphs, interactive fictionHIGHLIGHT£ºÎÒÃǽéÉÜÁËKG-A2C£¬ÕâÊÇÒ»ÖÖÇ¿»¯Ñ§**ÖÇÄÜÌ壬¿ÉÒÔÔÚʹÓÃtemplate-basedµÄ¶¯×÷¿Õ¼ä½øÐÐ̽Ë÷²¢Éú³É×ÔÈ»ÓïÑÔµÄͬʱ£¬¹¹½¨¶¯Ì¬ÖªÊ¶Í¼-Ôڹ㷺µÄ»ùÓÚÎı¾µÄÓÎÏ·ÖÐÓÅÓÚËùÓе±Ç°ÖÇÄÜÌå¡£½»»¥Ê½Ð¡ËµÓÎÏ·ÊÇ»ùÓÚÎı¾µÄÄ£Ä⣬ÆäÖеÄÖÇÄÜÌåÍêȫͨ¹ý×ÔÈ»ÓïÑÔÓëÊÀ½ç»¥¶¯¡£ËüÃÇÊÇÑо¿ÈçºÎÀ©Õ¹Ç¿»¯Ñ§**ÖÇÄÜÌåÒÔÂú×ã×éºÏÓïÑԵĽϴóµÄ¡¢»ùÓÚÎı¾µÄ¶¯×÷¿Õ¼äÖÐ×ÔÈ»ÓïÑÔÀí½â£¬²¿·Ö¿É¹Û²ìÐԺͶ¯×÷Éú³ÉµÈÌôÕ½µÄÀíÏë»·¾³¡£ÎÒÃǽéÉÜÁËKG-A2C£¬ÕâÊÇÒ»ÖÖ¿ÉÔÚ̽Ë÷¶¯Ì¬ÖªÊ¶Í¼µÄͬʱʹÓÃtemplate-basedµÄ¶¯×÷¿Õ¼äÉú³É¶¯×÷µÄÖÇÄÜÌå¡£ÎÒÃÇÈÏΪ£¬ÖªÊ¶Í¼µÄË«ÖØÊ¹ÓÃÀ´ÍÆÀíÓÎϷ״̬²¢ÏÞÖÆ×ÔÈ»ÓïÑÔµÄÉú³ÉÊÇ×éºÏ×ÔÈ»ÓïÑÔ¶¯×÷µÄ¿ÉÀ©Õ¹Ì½Ë÷µÄ¹Ø¼ü¡£¸÷ÖÖIFÓÎÏ·µÄ½á¹û±íÃ÷£¬¾¡¹Ü¶¯×÷¿Õ¼ä´óС³ÊÖ¸ÊýÔö³¤£¬KG-A2CµÄ±íÏÖÈÔÓÅÓÚĿǰµÄIFÖÇÄÜÌå¡£´úÂ룺https://github.com/rajammanabrolu/KG-A2C 39.¡¶Composing Task-Agnostic Policies with Deep Reinforcement Learning¡·¹Ø¼ü´Ê£ºcomposition, transfer learningHIGHLIGHT£ºÎÒÃÇÌá³öÁËÒ»ÖÖÐÂÓ±µÄ»ùÓÚÇ¿»¯Ñ§**µÄ¼¼ÄÜÇ¨ÒÆºÍ×éºÏ·½·¨£¬¸Ã·½·¨²ÉÓÃÖÇÄÜÌåµÄÔʼ²ßÂÔÀ´½â¾öÔ±¾Î´¼ûµÄÈÎÎñ¡£»ìºÏ´î½¨»ù±¾ÐÐΪ¿éÒÔ½â¾ö¾ßÓÐÌôÕ½ÐÔµÄ×ªÒÆÑ§**ÎÊÌâÊǹ¹½¨ÖÇÄÜ»úÆ÷µÄ¹Ø¼üÒªËØÖ®Ò»¡£¡£Æù½ñΪֹ£¬Ñо¿ÕßÔÚѧ**ÌØ¶¨ÓÚÈÎÎñµÄ²ßÂÔ»ò¼¼ÄÜ·½ÃæÒѾ½øÐÐÁË´óÁ¿¹¤×÷£¬µ«¼¸ºõûÓм¯Öо«Á¦×«Ð´ÓëTask-AgnosticµÄ±ØÒª¼¼ÄÜÒÔÕÒµ½ÐÂÎÊÌâµÄ½â¾ö·½°¸¡£ÔÚ±¾ÎÄÖУ¬ÎÒÃÇÌá³öÁËÒ»ÖÖеĻùÓÚÉî¶ÈÇ¿»¯Ñ§**µÄ¼¼ÄÜÇ¨ÒÆºÍ×éºÏ·½·¨£¬¸Ã·½·¨²ÉÓÃÖÇÄÜÌåµÄÔʼ²ßÂÔÀ´½â¾öÔ±¾Î´¼ûµÄÈÎÎñ¡£ÎÒÃÇÔÚÀ§ÄѵÄÇé¿öÏÂÆÀ¹ÀÁËз½·¨£¬ÔÚÕâЩÇé¿öÏ£¬Í¨¹ý±ê׼ǿ»¯Ñ§**(RL)ÉõÖÁÊÇ·Ö²ãRLѵÁ·²ßÂÔҪô²»¿ÉÐУ¬ÒªÃ´±íÏÖ³ö¸ßÑù±¾¸´ÔÓ¶È¡£ÎÒÃDZíÃ÷£¬Ð·½·¨²»½öÄܽ«¼¼ÄÜÇ¨ÒÆµ½ÐµÄÎÊÌâ»·¾³ÖУ¬¶øÇÒ»¹ÄÜÒÔ¸ßÊý¾ÝЧÂʽâ¾öÐèÒªÈÎÎñ¹æ»®ºÍÔ˶¯¿ØÖƵÄÌôÕ½ÐÔ»·¾³¡£´úÂ룺https://drive.google.com/file/d/1pbF9vMy5E3NLdOE5Id5zqzKlUesgStym/view usp=sharing 40.¡¶Single episode transfer for differing environmental dynamics in reinforcement learning¡·¹Ø¼ü´Ê£ºtransfer learningHIGHLIGHT£ºÍ¨¹ýÓÅ»¯Ì½²âÒÔ¿ìËÙÍÆ¶ÏDZ±äÁ¿²¢Á¢¼´Ö´ÐÐͨÓòßÂÔ£¬ÔÚ¾ßÓÐÏà¹Ø¶¯Ì¬»·¾³ÏµÁÐÖнøÐе¥Ê¼þ²ßÂÔ´«Êä¡£Ç¨ÒÆºÍÊÊӦеÄδ֪»·¾³¶¯Ì¬ÊÇÇ¿»¯Ñ§**µÄ¹Ø¼üÌôÕ½¡£¸ü´óµÄÌôÕ½ÊÇÔÚ²âÊÔʱ¼äµÄÒ»´Î³¢ÊÔÖпÉÄܼ¸ºõÎÞ·¨´ïµ½×î¼ÑЧ¹û£¬¶ø¿ÉÄÜÎÞ·¨»ñµÃ·áºñµÄ»Ø±¨£¬¶øµ±Ç°µÄ·½·¨È´ÎÞ·¨½â¾öÕâÒ»ÎÊÌ⣬ÐèÒª¶à´Î experience rollouts²ÅÄÜÊÊÓ¦¡£ÎªÁËÔÚ¾ßÓÐÏà¹Ø¶¯Á¦Ñ§µÄ»·¾³ÏµÁÐÖÐʵÏÖSingle episode£¬ÎÒÃÇÌá³öÁËÒ»ÖÖͨÓÃËã·¨£¬¸ÃËã·¨¿ÉÓÅ»¯Ì½²âÆ÷ºÍÍÆÀíÄ£ÐÍ£¬ÒÔ¿ìËÙ¹ÀËã²âÊÔ¶¯Á¦Ñ§µÄDZÔÚDZ±äÁ¿£¬È»ºó½«ÆäÁ¢¼´ÓÃ×÷ͨÓÿØÖƲßÂÔµÄÊäÈë¡£ÕâÖÖÄ£¿é»¯µÄ·½·¨¿ÉÒÔ¼¯³É×îеÄËã·¨ÒÔÓÃÓÚvariational inference»òRL¡£¶øÇÒ£¬ÎÒÃǵķ½·¨²»ÐèÒªÔÚ²âÊÔʱ»ñµÃ½±Àø£¬ÕâʹÆäÄܹ»ÔÚÏÖÓÐ×ÔÊÊÓ¦·½·¨ÎÞ·¨ÊµÏֵĻ·¾³ÖÐÖ´ÐС£ÔÚ¾ßÓÐSingle episode²âÊÔÔ¼ÊøµÄ²»Í¬ÊµÑéÁìÓòÖУ¬ÎÒÃǵķ½·¨Ã÷ÏÔÓÅÓÚÏÖÓеÄ×ÔÊÊÓ¦·½·¨£¬²¢ÇÒÔÚ³°ô´«ÊäµÄ»ù´¡ÉϱíÏÖ³öÁ¼ºÃµÄÐÔÄÜ¡£ 41.¡¶Model-Augmented Actor-Critic: Backpropagating through Paths¡·¹Ø¼ü´Ê£ºmodel-based, actor-critic, pathwiseHIGHLIGHT£ºÊ¹ÓÃѧ**µÄÄ£ÐͺÍQº¯Êýͨ¹ýʱ¼ä½øÐз´Ïò´«²¥À´ÊµÏÖ²ßÂÔÌݶȡ£µ±Ç°»ùÓÚÄ£Ð͵ÄÇ¿»¯Ñ§**·½·¨Ö»Êǽ«Ä£ÐÍÓÃ×÷ѧ**µÄºÚÏ»×ÓÄ£ÄâÆ÷£¬ÒÔÀ©³äÊý¾ÝÀ´½øÐвßÂÔÓÅ»¯»ò¼ÛÖµº¯Êýѧ**¡£ÔÚ±¾ÎÄÖУ¬ÎÒÃÇչʾÁËÈçºÎͨ¹ýÀûÓÃÄ£Ð͵ĿÉ΢·ÖÐÔ¸üÓÐЧµØÀûÓÃÄ£ÐÍ¡£ÎÒÃǹ¹ÔìÁËÒ»¸öʹÓÃѧ**µÄÄ£ÐͺͲßÂÔÔÚδÀ´Ê±¼ä²½³¤ÉϵÄ·¾¶µ¼ÊýµÄ²ßÂÔÓÅ»¯Ëã·¨¡£Í¨¹ýʹÓÃterminal¼ÛÖµº¯Êý£¬ÒÔactor-criticµÄ·½Ê½Ñ§**²ßÂÔ£¬¿ÉÒÔ·ÀÖ¹¿ç¶à¸öʱ¼ä²½Öèѧ**µÄ²»Îȶ¨ÐÔ¡£´ËÍ⣬ÎÒÃǸù¾ÝÄ£ÐͺÍÖµº¯ÊýÖеÄÌݶÈÎó²îÌá³öÁ˶ÔÄ¿±êµÄµ¥µ÷¸Ä½øµÄÍÆµ¼¡£ÎÒÃÇÖ¤Ã÷£¬Óë»ùÓÚÄ£Ð͵ÄÏÖÓÐËã·¨Ïà±È£¬ÎÒÃǵķ½·¨£¨i£©Ê¼ÖÕ¾ßÓиü¸ßµÄ²ÉÑùЧÂÊ£¬£¨ii£©Æ¥ÅäÎÞÄ£ÐÍËã·¨µÄ½¥½üÐÔÄÜ£¬²¢ÇÒ£¨iii£©À©Õ¹µ½ºÜ³¤µÄhorizons£¨ÔÚÕâÖÖÇé¿öÏ£¬¹ýÈ¥»ùÓÚÄ£Ð͵ķ½·¨Í¨³£»áÓöµ½À§ÄÑ£©¡£ 42.¡¶Robust Reinforcement Learning for Continuous Control with Model Misspecification¡·¹Ø¼ü´Ê£ºrobustnessHIGHLIGHT£ºÒ»ÖÖÓÃÓÚ½«Â³°ôÐÔ½¨Ä£µ½Á¬Ðø¿ØÖÆÇ¿»¯Ñ§**Ëã·¨ÖУ¬ÒÔ½«´íÎ󹿷¶½¨Ä£µÄ¿ò¼Ü¡£ÎÒÃÇÌṩÁËÒ»¸ö½«Â³°ôÐÔ-¹ý¶É¶¯Ì¬ÖеÄÈŶ¯£¨ÎÒÃÇ³ÆÆäΪģÐÍ´íÎóÖ¸¶¨£©ÄÉÈëÁ¬Ðø¿ØÖÆÇ¿»¯Ñ§**£¨RL£©Ëã·¨µÄ¿ò¼Ü¡£ÎÒÃÇÌØ±ðרעÓÚ½«Â³°ôÐÔ½áºÏµ½×îеÄÁ¬Ðø¿ØÖÆRLËã·¨ÖУ¬ÐÂËã·¨±»³ÆÎª×î´óºóÑé²ßÂÔÓÅ»¯£¨MPO£©¡£ÎÒÃÇͨ¹ýѧ**Ò»ÖÖÕë¶Ô×Çé¿ö½øÐÐÓÅ»¯µÄ²ßÂÔÀ´ÊµÏÖÕâһĿ±ê£¬Ð²ßÂÔ²ÉÓÃìØÕýÔò»¯µÄÆÚÍû»Ø±¨Ä¿±ê²¢µÃ³öÏàÓ¦µÄ³°ôìØÕýÔò»¯BellmanѹËõËã×Ó¡£ÁíÍ⣬ÎÒÃÇÒýÈëÁËÒ»¸öÏà¶Ô±£Êصģ¬Èí³°ôµÄ£¬ìØÕýÔò»¯Ä¿±êÒÔ¼°ÏàÓ¦µÄ±´¶ûÂüËã×Ó¡£ÊµÑé½á¹ûÏÔʾ£¬ÔÚ»·¾³ÈŶ¯Ï£¬Â³°ôºÍÈí³°ôµÄ²ßÂÔÔÚ9¸öMujocoÓòÖеÄÐÔÄܾùÓÅÓڷdz°ôµÄ²ßÂÔ¡£´ËÍ⣬ÎÒÃÇÔÚ¾ßÓÐÌôÕ½ÐԵģ¬Ä£ÄâµÄ£¬ÁéÇÉ»úÆ÷ÈËÊÖÉÏÏÔʾ³ö¸Ä½øµÄ³°ôÐÔÄÜ¡£ÊÓÆµ¼û sites.google.com/view/r ¡£ 43.¡¶Black-box Off-policy Estimation for Infinite-Horizon Reinforcement Learning¡·¹Ø¼ü´Ê£ºoff-policy estimation, importance sampling, propensity scoreHIGHLIGHT£ºÕë¶Ôinfinite-horizon RLÖеÄoff-policy¹À¼ÆÎÊÌ⣬ÎÒÃÇÌá³öÁËÒ»ÖÖÐÂÓ±µÄ·½·¨¡£ÔÚÐí¶àÏÖʵӦÓã¨ÈçÒ½ÁƱ£½¡ºÍ»úÆ÷È˼¼Êõ£©ÖУ¬¶Ôlong-horizonÎÊÌâµÄoff-policy¹À¼ÆºÜÖØÒª£¬ÔÚÕâЩӦÓÃÖУ¬ÎÒÃÇ¿ÉÄÜÎÞ·¨Ê¹Óø߱£Õ棨high-fidelity£©Ä£ÄâÆ÷£¬¶Ô²ßÂÔµÄÆÀ¹ÀÊǺܰº¹ó»ò²»¿ÉÄܵġ£×î½ü£¬\citet{liu18breaking}Ìá³öÁËÒ»ÖÖ·½·¨£¬±ÜÃâÁ˵äÐ͵ĻùÓÚÖØÒªÐÔÈ¡ÑùµÄ·½·¨ËùÔâÊܵÄhorizon×çÖä¡£ËäÈ»½á¹û¿´ÆðÀ´promising£¬µ«´Ë·½·¨ÔÚʵ¼ùÖÐÊÇÓÐÏ޵ģ¬ÒòΪËüÐèҪͨ¹ýÒ»¸öÒÑÖªµÄÐÐΪ²ßÂÔÀ´ÊÕ¼¯Êý¾Ý¡£ÔÚÕâÏ×÷ÖУ¬ÎÒÃÇÌá³öÁËÏû³ý´ËÀàÏÞÖÆµÄÐÂÓ±·½·¨¡£ÌØ±ðµØ£¬ÎÒÃǽ«ÎÊÌâµÄformulation»¯ÎªÇó½â¡°ÏòºóÁ÷¶¯¡±Ëã×ӵIJ»¶¯µã£¬²¢±íÃ÷²»¶¯µã½â¸ø³öÁËÄ¿±ê²ßÂÔºÍÐÐΪ²ßÂÔÖ®¼äÆÚÍûµÄƽÎÈ·Ö²¼µÄÖØÒªÐԱȡ£ÎÒÃÇ·ÖÎöÆä½¥½üÒ»ÖÂÐÔºÍÓÐÏÞÑù±¾Íƹ㡣»ù×¼²âÊÔÖ¤Ã÷ÁËÎÒÃÇÌá³öµÄ·½·¨µÄÓÐЧÐÔ¡£ 44.¡¶Graph Convolutional Reinforcement Learning¡·¹Ø¼ü´Ê£ºGCN£¬GNNHIGHLIGHT£ºÔÚ¶àÖÇÄÜÌå»·¾³ÖУ¬Ñ§**ºÏ×÷ÖÁ¹ØÖØÒª£¬ÆäÖеĹؼüÊÇÒªÁ˽âÖÇÄÜÌåÖ®¼äµÄÏ໥ӰÏì¡£µ«ÊÇ£¬¶àÖÇÄÜÌå»·¾³ÊǸ߶ȶ¯Ì¬µÄ£¬ÖÇÄÜÌå²»¶ÏÒÆ¶¯£¬ÆäÁÚ¾Ó¿ìËٱ仯¡£ÕâʹµÃѧ**ÖÇÄÜÌåÖ®¼äÏ໥×÷ÓõijéÏó±íʾ±äµÃÀ§ÄÑ¡£Îª½â¾öÕâЩÀ§ÄÑ£¬ÎÒÃÇÌá³öÁËͼ¾í»ýÇ¿»¯Ñ§**£¬ÆäÖÐͼ¾í»ýÊÊÓ¦ÓÚ¶àÖÇÄÜÌå»·¾³µÄ»ù´¡Í¼µÄ¶¯Á¦Ñ§£¬ÇÒ¹ØÏµÄÚºËͨ¹ýËüÃǵĹØÏµ±íʾÀ´²¶»ñÖÇÄÜÌå¼äµÄÏ໥×÷Óá£ÀûÓþí»ý²ã´ÓÖð½¥Ôö¼ÓµÄ½ÓÊܳ¡ÖвúÉúµÄDZÔÚÌØÕ÷À´Ñ§**ºÏ×÷£¬²¢ÇÒͨ¹ýʱ¼ä¹ØÏµ(temporal relation)ÕýÔò»¯½øÒ»²½¸Ä½øºÏ×÷ÒÔ±£³ÖÒ»ÖÂÐÔ¡£´úÂ룺https://github.com/PKU-AI-Edge/DGN/ 45.¡¶Thinking While Moving: Deep Reinforcement Learning with Concurrent Control¡· (Poster)¹Ø¼ü´Ê£ºcontinuous-time, roboticsHIGHLIGHT£ºÇ¿»¯Ñ§**µÄformulationÔÊÐíÖÇÄÜÌåͬʱ˼¿¼ºÍ²ÉÈ¡Ðж¯£¬ÕâÔÚÕæÊµµÄ»úÆ÷ÈËץȡÖеõ½ÁËÖ¤Ã÷¡£ÂÛÎÄÖеÄÇ¿»¯Ñ§**»·¾³ÉèÖÃÈçÏ£ºÖÇÄÜÌ屨ÐëÔÚÊÜ¿ØÏµÍ³µÄʱ¼äÑݱä¹ý³ÌÖÐͬʱ´Ó²ßÂÔÖвÉÑù¶¯×÷£¬ÀýÈç»úÆ÷È˱ØÐëÔÚÉÏÒ»¸ö¶¯×÷Íê³É֮ǰ¾ö¶¨ÏÂÒ»¸ö¶¯×÷£¨Í¬Ê±Ë¼¿¼ºÍÒÆ¶¯£©¡£ÎªÁË¿ª·¢Õë¶Ô´ËÀಢ·¢¿ØÖÆÎÊÌâµÄËã·¨¿ò¼Ü£¬ÎÒÃÇ´ÓBellman·½³ÌµÄÁ¬ÐøÊ±¼ä¹«Ê½»¯¿ªÊ¼£¬È»ºóÒÔÒâʶµ½ÏµÍ³Ñӳٵķ½Ê½ÀëÉ¢»¯ËüÃÇ¡£ÎÒÃÇͨ¹ý¶ÔÏÖÓлùÓÚÖµµÄDRLËã·¨µÄ¼òµ¥Ìåϵ½á¹¹À©Õ¹£¬ÊµÀý»¯´ËÀàеĽüËÆ¶¯Ì¬±à³Ì·½·¨¡£ 46.¡¶Evolutionary Population Curriculum for Scaling Multi-Agent Reinforcement Learning¡·¹Ø¼ü´Ê£ºevolutionary learning, curriculum learningHIGHLIGHT£ºÔÚ¶àÖÇÄÜÌåÓÎÏ·ÖУ¬»·¾³µÄ¸´ÔÓÐÔ»áËæ×ÅÖÇÄÜÌåÊýÁ¿µÄÔö¼Ó¶ø³ÊÖ¸ÊýÔö³¤£¬Òò´Ë£¬µ±ÖÇÄÜÌåÊýÖÚ¶àʱ£¬Ñ§**Á¼ºÃµÄ²ßÂÔÓÈÆä¾ßÓÐÌôÕ½ÐÔ¡£ÔÚ±¾ÎÄÖУ¬ÎÒÃǽéÉÜÁ˽ø»¯È˿ڿγ̣¨EPC£©£¬ÕâÊÇÒ»ÖֿγÌѧ**·¶Àý£¬Ëüͨ¹ýÖð²½Ôö¼ÓѵÁ·ÖÇÄÜÌåµÄÊýÁ¿À´À©Õ¹¶àÖÇÄÜÌåÇ¿»¯Ñ§**£¨MARL£©¡£´ËÍ⣬EPCʹÓýø»¯·½·¨À´½â¾öÕû¸ö¿Î³ÌÖеĿ͹Ûʧµ÷ÎÊÌ⣺ÔÚÔçÆÚÒÔÉÙÁ¿È˿ڳɹ¦ÑµÁ·µÄÖÇÄÜÌå²»Ò»¶¨ÊÇÊÊÓ¦ºóÆÚÈ˿ڹæÄ£À©´óµÄ×î¼ÑÈËÑ¡¡£¾ßÌåÀ´Ëµ£¬EPCÔÚÿ¸ö½×¶Î¶¼»áά»¤¶à×éÖÇÄÜÌ壬ÔÚÕâЩ¼¯ºÏÉÏÖ´ÐлìºÏÆ¥ÅäºÍ΢µ÷£¬²¢ÒÔ×î¼ÑÊÊÓ¦ÐÔÌáÉýÖÇÄÜÌåµ½ÏÂÒ»½×¶Î¡£ÎÒÃÇÔÚÒ»ÖÖÁ÷ÐеÄMARLËã·¨MADDPGÉÏʵÏÖÁËEPC£¬²¢Í¨¹ý¾ÑéÖ¤Ã÷£¬Ëæ×ÅÖÇÄÜÌåÊýÁ¿³ÊÖ¸ÊýÔö³¤£¬ÎÒÃǵķ½·¨Ê¼ÖÕÔÚÐÔÄÜÉÏʼÖÕÓÅÓÚ»ùÏß¡£Ô´´úÂëºÍÊÓÆµ¼ûhttps://sites.google.com/view/epciclr2020 ¡£´úÂ룺https://github.com/qian18long/epciclr2020 47.¡¶A Simple Randomization Technique for Generalization in Deep Reinforcement Learning¡·¹Ø¼ü´Ê£ºGeneralization in visual domainsHIGHLIGHT£ºÎÒÃÇÌá³öÁËÒ»ÖÖ¼òµ¥µÄËæ»ú»¯¼¼Êõ£¬ÓÃÓÚ¸ÄÉÆ¾ßÓи÷ÖÖÎ´Ôø¼û¹ýµÄÊÓ¾õģʽµÄÈÎÎñµÄÉî¶ÈÇ¿»¯Ñ§**µÄ·º»¯¡£DRLÖÇÄÜÌåͨ³£ÎÞ·¨ÍÆ¹ãµ½Î´Ôø¼û¹ýµÄ»·¾³£¬ÓÈÆäµ±ËüÃÇÔÚ¸ßά״̬¿Õ¼ä£¨ÈçͼÏñ£©ÉϽøÐÐѵÁ·Ê±¡£ÔÚ±¾ÎÄÖУ¬ÎÒÃÇÌá³öÁËÒ»ÖÖ¿Éͨ¹ýÒýÈëËæ»úÈÅÂÒÊäÈë¹Û²âÖµµÄËæ»ú£¨¾í»ý£©Éñ¾ÍøÂçÀ´Ìá¸ßÉî²ãRLÖÇÄÜÌå·º»¯ÄÜÁ¦µÄ¼òµ¥¼¼Êõ¡£Í¨¹ý¿ç±ä»¯ºÍËæ»ú»·¾³ÖеÄlearning robust features invariant£¬ÐÂË㷨ʹÊܹýѵÁ·µÄÖÇÄÜÌåÄܹ»ÊÊÓ¦ÐÂÁìÓò¡£´ËÍ⣬ÎÒÃÇ¿¼ÂÇÁËÒ»ÖÖ»ùÓÚÃÉÌØ¿¨Âå½üËÆµÄÍÆÀí·½·¨£¬ÒÔ¼õÉÙÓɸÃËæ»ú»¯ÒýÆðµÄ·½²î¡£ÎÒÃÇչʾÁËз½·¨ÔÚ2D CoinRun£¬3D DeepMind Lab̽Ë÷ºÍ3D»úÆ÷ÈË¿ØÖÆÈÎÎñÖеÄÓÅÔ½ÐÔ£ºÐÂËã·¨Ã÷ÏÔÓÅÓÚ¸÷ÖÖÕýÔò»¯ºÍÊý¾ÝÔöÇ¿·½·¨¡£´úÂ룺https://github.com/pokaxpoka/netrand 48.¡¶Reinforced Genetic Algorithm Learning for Optimizing Computation Graphs¡·¹Ø¼ü´Ê£ºlearning to optimize, combinatorial optimization, computation graphs, model parallelism, learning for systemsHIGHLIGHT£ºÎÒÃÇʹÓÃDRLÀ´Ñ§**Ö¸µ¼ÒÅ´«Ëã·¨ËÑË÷µÄ²ßÂÔ£¬ÒÔ¸üºÃµØÓÅ»¯¼ÆËãͼµÄÖ´Ðгɱ¾£¬²¢ÔÚʵ¼ÊµÄTensorFlowͼÉÏÏÔʾ¸Ä½øµÄ½á¹û¡£ÎÒÃÇÌá³öÁËÒ»ÖÖÉî¶ÈÇ¿»¯Ñ§**·½·¨£¬ÒÔ×îС»¯ÓÅ»¯±àÒëÆ÷ÖÐÉñ¾ÍøÂç¼ÆËãͼµÄÖ´Ðгɱ¾¡£ÓëÔçÆÚµÄ»ùÓÚѧ**µÄ¹¤×÷ÐèÒªÔÚͬһͼÉ϶ÔÓÅ»¯Æ÷½øÐÐѵÁ·ÒÔ½øÐÐÓÅ»¯²»Í¬£¬ÎÒÃÇÌá³öÁËÒ»ÖÖѧ**·½·¨£¬¸Ã·½·¨ÀëÏßѵÁ·ÓÅ»¯Æ÷£¬È»ºó½«ÆäÍÆ¹ãµ½ÒÔǰ¿´²»¼ûµÄͼ£¬¶øÎÞÐè½øÒ»²½ÑµÁ·¡£ÕâʹÎÒÃǵķ½·¨¿ÉÒÔÔÚ¼¸ÃëÖÓ£¨¶ø²»ÊǼ¸Ð¡Ê±£©ÄÚÔÚÏÖʵÊÀ½çµÄTensorFlowͼÉϲúÉú¸ßÖÊÁ¿µÄ¾ö²ß¡£ÎÒÃÇΪ¼ÆËãͼ¿¼ÂÇÁ½¸öÓÅ»¯ÈÎÎñ£º×îС»¯ÔËÐÐʱ¼äºÍ·åÖµÄÚ´æÊ¹Óá£ÔÚÕâÁ½¸öÈÎÎñÉÏ£¬ÎÒÃǵķ½·¨±È¾µä·½·¨ºÍÆäËû»ùÓÚѧ**µÄ·½·¨È¡µÃÁËÏÔןĽø¡£ 49.¡¶Projection Based Constrained Policy Optimization¡·¹Ø¼ü´Ê£ºSafe reinforcement learning¡¢constrained RLHIGHLIGHT£ºÎÒÃÇÌá³öÁËÒ»ÖÖ¿ÉÒÔѧ**Âú×ãÔ¼ÊøÌõ¼þµÄ²ßÂÔ£¬²¢ÔÚÓÐÔ¼ÊøÌõ¼þµÄÇ¿»¯Ñ§**±³¾°ÏÂÌṩÀíÂÛ·ÖÎöºÍ¾ÑéÖ¤Ã÷µÄÐÂËã·¨¡£ÎÒÃÇ¿¼ÂÇÁËѧ**¿ØÖƲßÂÔµÄÎÊÌ⣬ÕâЩ²ßÂÔÔÚÓÅ»¯½±Àøº¯ÊýµÄͬʱ£¬ÐèÒªÂú×ã¹ØÓÚ°²È«¡¢¹«Æ½»òÆäËû³É±¾µÄÔ¼Êø¡£ÎÒÃÇÌá³öÁËÒ»ÖÖÐÂËã·¨¨C»ùÓÚͶӰµÄÔ¼Êø²ßÂÔÓÅ»¯£¨PCPO£©£¬ÕâÊÇÒ»ÖÖÔÚÁ½²½¹ý³ÌÖÐÓÅ»¯²ßÂԵĵü´ú·½·¨¨CµÚÒ»²½Ö´ÐÐÎÞÔ¼Êø¸üУ¬µÚ¶þ²½Í¨¹ý½«²ßÂÔͶÉä»ØÔ¼Êø¼¯ÉÏÀ´µ÷½ÚÎ¥·´Ô¼ÊøµÄÇé¿ö¡£ÎÒÃÇ´ÓÀíÂÛÉÏ·ÖÎöÁËPCPO£¬²¢ÎªÃ¿´Î²ßÂÔ¸üÐÂÌṩÁ˽±Àø¸Ä½øµÄÏÂÏÞÒÔ¼°Ô¼ÊøÎ¥·´µÄÉÏÏÞ¡£ÎÒÃǽøÒ»²½»ùÓÚÁ½¸ö²»Í¬µÄÖ¸±ê¨CL2 normºÍKullback-Leibler pergence¨CÀ´ÃèÊöPCPOÓëͶӰµÄÊÕÁ²ÐÔ¡£ÔÚ¼¸¸ö¿ØÖÆÈÎÎñÉϵľÑé½á¹û±íÃ÷£¬ÎÒÃǵÄË㷨ʵÏÖÁË׿ԽµÄÐÔÄÜ£¬Óë×îÏȽøµÄ·½·¨Ïà±È£¬ÐÂË㷨ƽ¾ù¼õÉÙÁË3.5±¶ÒÔÉϵÄÔ¼ÊøÎ¥·´£¬²¢Ìá¸ßÁËÔ¼15%µÄ½±Àø¡£´úÂ룺https://sites.google.com/view/iclr2020-pcpo 50.¡¶Infinite-Horizon Differentiable Model Predictive Control¡·¹Ø¼ü´Ê£ºModel Predictive Control, Riccati Equation, Imitation Learning, Safe LearningHIGHLIGHT£º±¾ÎÄÌá³öÁËÒ»ÖÖ¿É΢·ÖµÄÏßÐÔ¶þ´ÎÄ£ÐÍÔ¤²â¿ØÖÆ£¨MPC£©¿ò¼Ü£¬ÓÃÓÚ°²È«Ä£·Âѧ**¡£ÀûÓôÓÀëɢʱ¼ä´úÊýRiccati·½³Ì(DARE)Öеõ½µÄÖն˳ɱ¾º¯ÊýÀ´Ç¿ÖÆÖ´ÐÐInfinite-Horizon³É±¾£¬´Ó¶ø¿ÉÖ¤Ã÷ѧ**µÄ¿ØÖÆÆ÷ÔÚ±Õ»·ÖÐÊÇÎȶ¨µÄ¡£ÂÛÎĵĺËÐűÏ×Ö®Ò»ÊÇÍÆµ¼ÁËDARE½âµÄ½âÎöµ¼Êý£¬´Ó¶øÔÊÐíʹÓûùÓÚ΢·ÖµÄѧ**·½·¨¡£ÁíÒ»¸ö¹±Ï×ÊÇMPCÓÅ»¯ÎÊÌâµÄ½á¹¹£º1.ÔöÇ¿µÄÀ¸ñÀÊÈÕ·½·¨È·±£MPCÓÅ»¯ÔÚÕû¸öѵÁ·¹ý³ÌÖÐÊÇ¿ÉÐеģ¬Í¬Ê±¶Ô״̬ºÍÊäÈë½øÐÐÓ²Ô¼Êø£¬2.Ô¤Îȶ¨»¯¿ØÖÆÆ÷È·±£MPC½âºÍµ¼ÊýÔÚÿ´Îµü´úÖж¼ÊÇ׼ȷµÄ¡£¸Ã¿ò¼ÜµÄѧ**ÄÜÁ¦ÔÚÒ»×éÊýÖµÑо¿Öеõ½ÁËÖ¤Ã÷¡£ 51.¡¶Toward Evaluating Robustness of Deep Reinforcement Learning with Continuous Control¡·¹Ø¼ü´Ê£ºdeep learning, robustness, adversarial examplesHIGHLIGHT£ºÎÒÃÇÑо¿¾ßÓжԿ¹ÐÔ¹¥»÷µÄDRLÖеÄÁ¬Ðø¿ØÖÆÖÇÄÜÌåÎÊÌ⣬²¢»ùÓÚѧ**µÄÄ£ÐͶ¯Á¦Ñ§Ìá³öÁËÁ½²½Ëã·¨¡£DRLÔÚÐí¶àÒÔǰÀ§ÄѵÄÇ¿»¯Ñ§**ÈÎÎñÖÐÈ¡µÃÁ˾޴ó³É¹¦£¬µ«×î½üµÄÑо¿±íÃ÷£¬ÀàËÆÓÚ·ÖÀàÈÎÎñÖеÄDNN£¬DRLÖÇÄÜÌåÒ²²»¿É±ÜÃâµØÈÝÒ×Êܵ½¶Ô¿¹ÐÔ¸ÉÈÅ¡£ÏÈǰ¹¤×÷Ö÷Òª¼¯ÖÐÔÚÎÞÄ£Ð͵ĶԿ¹¹¥»÷ºÍ¾ßÓÐÀëÉ¢¶¯×÷µÄÖÇÄÜÌåÉÏ¡£ÔÚÕâÏ×÷ÖУ¬ÎÒÃÇÑо¿Á˾ßÓжԿ¹ÐÔ¹¥»÷µÄDRLÖеÄÁ¬Ðø¿ØÖÆÖÇÄÜÌåÎÊÌ⣬²¢»ùÓÚѧ**µÄÄ£ÐͶ¯Á¦Ñ§Ìá³öÁ˵ÚÒ»¸öÁ½²½Ëã·¨¡£ÔÚ¸÷ÖÖMuJoCoÓò£¨Cartpole£¬Fish£¬Walker£¬Humanoid£©ÉϽøÐеĴóÁ¿ÊµÑé±íÃ÷£¬ÎÒÃÇÌá³öµÄ¿ò¼ÜÔÚ½µµÍÖÇÄÜÌåÐÔÄÜÒÔ¼°½«ÖÇÄÜÌåÇý¶¯µ½²»°²È«×´Ì¬·½Ãæ±È»ùÓÚÎÞÄ£Ð͵Ĺ¥»÷»ù×¼ÒªÓÐЧµÃ¶à¡£ 52.¡¶Meta-learning curiosity algorithms¡·¹Ø¼ü´Ê£ºmeta-learning, exploration, curiosityHIGHLIGHT£ºÍ¨¹ýËÑË÷ a rich space of programs£¬ÔªÑ§**ºÃÆæÐÄËã·¨¼¤·¢Á˺ܶàÐÂÓ±µÄÉè¼Æ£¬ÕâЩÉè¼Æ¿ÉÒÔÔڷdz£²»Í¬µÄÇ¿»¯Ñ§**ÁìÓòÖÐͨÓá£ÎÒÃǼÙÉèºÃÆæÐÄÊÇÒ»ÖÖÓɽø»¯·¢ÏֵĻúÖÆ£¬Ëü¹ÄÀøÖÇÄÜÌåÔÚÆäÉúÃüÔçÆÚ½øÐÐÓÐÒâÒåµÄ̽Ë÷£¬ÒÔʹÆäÔÚÒ»ÉúÖлñµÃ¸ß»Ø±¨µÄ¾Ñé¡£ÎÒÃǽ«²úÉúºÃÆæÐÐΪµÄÎÊÌâ±íÊöΪԪѧ**µÄÎÊÌ⣺ÍâÑ»·½«ÔÚºÃÆæ»úÖÆµÄ¿Õ¼äÉÏËÑË÷£¬¶¯Ì¬µ÷ÕûÖÇÄÜÌåµÄ½±ÀøÐźţ¬ÄÚÑ»·½«Ê¹Óõ÷ÕûºóµÄ½±ÀøÐźŽøÐбê×¼µÄÇ¿»¯Ñ§**¡£È»¶ø£¬Ä¿Ç°»ùÓÚÇ¨ÒÆÉñ¾ÍøÂçÈ¨ÖØµÄÔªRL·½·¨Ö»Ôڷdz£ÏàËÆµÄÈÎÎñÖ®¼ä½øÐÐÁË·º»¯¡£ÎªÁËÀ©´ó·º»¯·¶Î§£¬ÎÒÃÇÌá³öԪѧ**Ëã·¨£ºÀàËÆÓÚÈËÀàÔÚMLÂÛÎÄÖÐÉè¼ÆµÄ´úÂëÆ¬¶Î£¬ÎÒÃǷḻµÄ³ÌÐòÓïÑÔ½«Éñ¾ÍøÂçÓ뻺³åÆ÷¡¢×î½üÁÚÄ£¿éºÍ×Ô¶¨ÒåËðʧº¯ÊýµÈÆäËû¹¹¼þÏà½áºÏ¡£ÎÒÃÇÒÔʵ֤µÄ·½Ê½Ö¤Ã÷ÁËÕâÖÖ·½·¨µÄÓÐЧÐÔ£¬²¢·¢ÏÖÁËÁ½ÖÖÐÂÐÍµÄºÃÆæÐÄËã·¨£¬ËüÃǵÄÐÔÄÜÓëÈËÀàÉè¼ÆµÄÒÑ·¢±íµÄºÃÆæÐÄËã·¨Ï൱»ò¸üºÃ£¨ÊµÑ飺grid navigation with image inputs, acrobot, lunar lander, ant and hopper£©¡£´úÂ룺https://github.com/mfranzs/meta-learning-curiosity-algorithms 53.¡¶Keep Doing What Worked: Behavior Modelling Priors for Offline Reinforcement Learning¡·¹Ø¼ü´Ê£ºOff-policy, Multitask, Continuous ControlHIGHLIGHT£ºÎÒÃÇ¿ª·¢ÁËÒ»ÖִӼǼµÄÊý¾ÝÖнøÐÐÎȶ¨µÄofflineÇ¿»¯Ñ§**µÄ·½·¨¡£¹Ø¼üÊÇÕë¶Ôѧ**µ½µÄÊý¾ÝµÄ¡°ÓÅÊÆ¼ÓȨ¡±Êý¾ÝÄ£Ð͹淶RL²ßÂÔ¡£Off-policyÇ¿»¯Ñ§**Ëã·¨ÓÐÍûÊÊÓÃÓÚÖ»Óй̶¨µÄ»·¾³½»»¥Êý¾Ý¼¯(batch)ÇÒÎÞ·¨»ñµÃоÑéµÄ»·¾³ÖС£ÕâÒ»ÌØÐÔʹµÃÕâЩËã·¨¶Ô»úÆ÷ÈË¿ØÖƵÈÏÖʵÊÀ½çÎÊÌâºÜÓÐÎüÒýÁ¦¡£È»¶ø£¬ÔÚʵ¼ùÖУ¬±ê×¼µÄOff-policyËã·¨ÔÚÁ¬Ðø¿ØÖƵÄÅú´¦Àí»·¾³ÖÐÊÇʧ°ÜµÄ¡£ÔÚ±¾ÎÄÖУ¬ÎÒÃÇÌá³öÁËÒ»¸ö¼òµ¥Ëã·¨À´½â¾öÕâ¸öÎÊÌâ¡£ËüÔÊÐíʹÓÃÓÉÈÎÒâÐÐΪ²ßÂÔ²úÉúµÄÊý¾Ý£¬²¢Ê¹ÓÃѧ**µ½µÄÏÈÑé¨CÓÅÊÆ¼ÓȨÐÐΪģÐÍ(ABM)¨C½«RL²ßÂÔÆ«ÏòÓÚÒÔǰÒѾִÐйýµÄ¡¢ÓпÉÄÜÔÚÐÂÈÎÎñÉϳɹ¦µÄ¶¯×÷¡£ÎÒÃǵķ½·¨¿É±»¿´×÷ÊÇ×î½üÅú´¦ÀíRL¹¤×÷µÄÀ©Õ¹£¬Ëü¿É´Ó³åÍ»µÄÊý¾ÝÔ´ÖнøÐÐÎȶ¨µÄѧ**¡£ÊµÑéÉæ¼°ÁËÕæÊµÊÀ½ç»úÆ÷È˵ĶàÈÎÎñѧ**¡£ 54.¡¶Model-based reinforcement learning for biological sequence design¡·¹Ø¼ü´Ê£ºblackbox optimization, molecule designHIGHLIGHT£ºÎÒÃÇͨ¹ýÐòÁм¶ÖÇÄÜÌå½±Àøº¯ÊýºÍ»ùÓÚ¼ÆÊýµÄvisitation bonusÀ´ÔöÇ¿ÎÞÄ£ÐͲßÂÔѧ**£¬²¢Ö¤Ã÷ÔÚÉè¼ÆDNAºÍµ°°×ÖÊÐòÁÐʱ¿É¿´µ½µÄ´óÅúÁ¿£¬low-roundµÄÓÐЧÐÔ¡£Éè¼ÆÉúÎï½á¹¹£¨ÈçDNA»òµ°°×ÖÊ£©Éæ¼°Ò»¸ö¾ßÓÐÌôÕ½ÐԵĺÚÏäÓÅ»¯ÎÊÌ⣬ÆäÌØÕ÷ÊÇÓÉÓÚÐèÒª½øÐÐÀͶ¯Ãܼ¯Ð͵Äwet labÆÀ¹À£¬Òò´ËÅú´Î´ó+low-round¡£¶Ô´Ë£¬ÎÒÃǽ¨ÒéʹÓûùÓÚ½ü¶Ë²ßÂÔÓÅ»¯£¨PPO£©µÄÇ¿»¯Ñ§**£¨RL£©½øÐÐÉúÎïÐòÁÐÉè¼Æ¡£RLΪÓÅ»¯Éú³ÉÐòÁÐÄ£ÐÍÌṩÁËÁé»î¿ò¼Ü£¬ÒÔʵÏÖÌØ¶¨µÄ±ê×¼£¬ÀýÈç±»ÍÚ¾òµÄµÄ¸ßÖÊÁ¿ÐòÁÐÖ®¼äµÄ¶àÑùÐÔ¡£ÎÒÃÇÌá³öÁËÒ»ÖÖ»ùÓÚÄ£Ð͵ÄPPO±äÌåDyNA-PPOÒÔÌá¸ßÑùƷЧÂÊ£¬Ë㷨ʹÓÃÊʺÏÏÈǰ»ØºÏfunctional measurementsµÄÄ£ÄâÆ÷ÀëÏßѵÁ·Ð»غϵIJßÂÔ¡£ÎªÊÊÓ¦Ô½À´Ô½¶àµÄ¿çÂִι۲죬Ëã·¨ÔÚÿ¸öÂÖ´ÎÖдÓÈÝÁ¿²»Í¬µÄ¶àÖÖÄ£ÐÍÖÐ×Ô¶¯Ñ¡ÔñÄ£ÄâÆ÷Ä£ÐÍ¡£ÔÚÉè¼ÆDNAת¼Òò×Ó½áºÏλµã£¬Éè¼Æ¿¹Î¢ÉúÎïµ°°×ÖÊÒÔ¼°»ùÓÚµ°°×ÖʽṹÓÅ»¯IsingÄ£Ð͵ÄÄÜÁ¿µÄÈÎÎñÉÏ£¬ÎÒÃÇ·¢ÏÖDyNA-PPOÔÚ¿ÉÐеĽ¨Ä£»·¾³ÖеÄÐÔÄÜÃ÷ÏÔÓÅÓÚÏÖÓз½·¨£¬ÇÒÔÚÎÞ·¨Ñ§**¿É¿¿Ä£Ð͵ÄÇé¿öÏ£¬Ð§¹û²¢Ã»Óиü²î¡£ 55.¡¶Meta Reinforcement Learning with Autonomous Inference of Subtask Dependencies¡·¹Ø¼ü´Ê£ºMeta reinforcement learning, subtask graphHIGHLIGHT£ºÒ»ÖÖÐÂÓ±µÄmeta-RL·½·¨£¬¿ÉÒÔÍÆ¶ÏDZÔÚµÄ×ÓÈÎÎñ½á¹¹ÎÒÃÇÌá³ö²¢½â¾öÁËÒ»¸öÐÂÓ±µÄfew-shot RLÎÊÌ⣬ÆäÖÐÈÎÎñÒÔ×ÓÈÎÎñÍ¼ÎªÌØÕ÷£¬¸Ã×ÓÈÎÎñͼÃèÊöÁËÖÇÄÜÌåδ֪µÄÒ»×é×ÓÈÎÎñ¼°ÆäÒÀÀµÐÔ¡£ÖÇÄÜÌåÐèÒªÔÚÊÊÓ¦½×¶ÎµÄ¼¸¸öÇé½ÚÖпìËÙÊÊÓ¦ÈÎÎñ£¬ÒÔʹ²âÊԽ׶εÄÊÕÒæ×î´ó»¯¡£ÎÒÃÇûÓÐÖ±½Óѧ**Ôª²ßÂÔ£¬¶øÊÇ¿ª·¢ÁË´øÓÐ×ÓÈÎÎñÍ¼ÍÆÀí£¨MSGI£©µÄԪѧ**Æ÷£¬¸Ã×Óѧ**Æ÷ͨ¹ýÓë»·¾³½»»¥À´ÍƶÏÈÎÎñµÄDZÔÚ²ÎÊý£¬²¢ÔÚ¸ø¶¨Ç±ÔÚ²ÎÊýµÄÇé¿öÏÂ×î´ó»¯»Ø±¨¡£Îª´Ù½øÑ§**£¬ÎÒÃDzÉÓÃÁ˹ÌÓеĽ±Àø·½Ê½£¬¸Ã½±Àø·½Ê½Êܵ½¹ÄÀøÓÐЧ̽Ë÷µÄÉÏÏÞÖÃÐŶȣ¨UCB£©µÄÆô·¢¡£ÎÒÃÇÔÚÁ½¸ögrid-worldÓòºÍStarCraft II»·¾³ÉϵÄʵÑé½á¹û±íÃ÷£¬Ð·½·¨Äܹ»×¼È·ÍƶÏDZÔÚÈÎÎñ²ÎÊý£¬ 56.¡¶Never Give Up: Learning Directed Exploration Strategies¡·¹Ø¼ü´Ê£ºexploration, intrinsic motivationHIGHLIGHT£ºÎÒÃÇÌá³öÁËÒ»ÖÖÇ¿»¯Ñ§**ÖÇÄÜÌ壬ͨ¹ýѧ**һϵÁе͍Ïò̽Ë÷ÐÔ²ßÂÔÀ´½â¾öÀ§ÄѵÄ̽Ë÷ÓÎÏ·¡£ÎÒÃǹ¹½¨ÁËÒ»¸ö»ùÓÚż·¢ÐÔ¼ÇÒäµÄÄÚÔÚ½±Àø£¬Ê¹ÓÃk-×î½üÁÚ¶ÔÖÇÄÜÌåµÄ×î½ü¾Ñé½øÐÐѵÁ·¶¨Ïò̽Ë÷ÐÔ²ßÂÔ£¬´Ó¶ø¹ÄÀøÖÇÄÜÌå·´¸´ÖØ·ÃÆä»·¾³ÖеÄËùÓÐ״̬¡£²ÉÓÃ×ԼලµÄÄæ¶¯Á¦Ñ§Ä£ÐÍÀ´ÑµÁ·×î½üÁÚ²éÕÒµÄǶÈ룬½«ÐÂÆæÐÅºÅÆ«ÏòÓÚÖÇÄÜÌå¿ÉÒÔ¿ØÖƵķ½Ïò¡£ÎÒÃDzÉÓÃͨÓüÛÖµº¯Êý±Æ½üÆ÷µÄ¿ò¼Ü£¬ÓÃͬһÉñ¾ÍøÂçͬʱѧ**Ðí¶à¶¨Ïò̽Ë÷²ßÂÔ£¬ÔÚ̽Ë÷ºÍÀûÓÃÖ®¼ä½øÐв»Í¬µÄȨºâ¡£Í¨¹ý¶Ô²»Í¬³Ì¶ÈµÄ̽Ë÷/ÀûÓÃʹÓÃÏàͬµÄÉñ¾ÍøÂ磬֤Ã÷ÁË´ÓÖ÷ÒªµÄ̽Ë÷ÐÔ²ßÂÔ×ªÒÆµ½ÓÐЧµÄÀûÓÃÐÔ²ßÂÔ¡£Ð·½·¨¿ÉÒÔÓëÏÖ´ú·Ö²¼Ê½RLÖÇÄÜÌåÒ»ÆðÔËÐУ¬ÕâЩÖÇÄÜÌå¿ÉÒÔ´ÓÔÚ²»Í¬»·¾³ÊµÀýÉϲ¢ÐÐÔËÐеÄÐí¶àactorsÄÇÀïÊÕ¼¯´óÁ¿¾Ñé¡£ÎÒÃǵķ½·¨ÔÚAtari-57 suiteÖеÄËùÓÐÀ§ÄÑ̽Ë÷ÖеÄÐÔÄÜÊÇ»ù´¡ÖÇÄÜÌåµÄÁ½±¶£¬Í¬Ê±ÔÚÆäÓàÓÎÏ·Öб£³ÖÁ˷dz£¸ßµÄ·ÖÊý¡£ÖµµÃ×¢ÒâµÄÊÇ£¬Ð·½·¨ÊǵÚÒ»¸öÔÚ²»Ê¹ÓÃdemonstrations »òÊÖ¹¤ÖÆ×÷µÄÌØÕ÷µÄÇé¿öÏ£¬ÔÚ¡¶Pitfall£¡¡·ÓÎÏ·ÖÐʵÏÖ·ÇÁã½±ÀøµÄËã·¨£¨Æ½¾ù·ÖÊýΪ8400·Ö£©¡£ 57.¡¶Discriminative Particle Filter Reinforcement Learning for Complex Partial observations¡·¹Ø¼ü´Ê£ºPartial Observability, Differentiable Particle FilteringHIGHLIGHT£ºÎÒÃÇÒýÈëÁËDPFRL£¬ÕâÊÇÒ»¸öͨ¹ýÖØÒªÐÔ¼ÓȨÁ£×ÓÂ˲¨Æ÷ÔÚ²¿·ÖºÍ¸´ÔÓ¹Û²âϽøÐÐÇ¿»¯Ñ§**µÄ¿ò¼Ü¡£DRLÔÚÖîÈçAtari£¬GoµÈ¸´ÔÓÓÎÏ·µÄ¾ö²ßÖÐÊdzɹ¦µÄ¡£µ«ÊÇ£¬ÏÖʵÊÀ½çÖеľö²ßͨ³£ÐèÒªÍÆÀí£¬²¢´Ó¸´ÔÓµÄÊÓ¾õ¹Û²ìÖÐÌáÈ¡²¿·ÖÐÅÏ¢¡£±¾ÎĽéÉÜÁËÅбðʽÁ£×ÓÂ˲¨Ç¿»¯Ñ§**£¨DPFRL£©£¬ÕâÊÇÒ»ÖÖÓÃÓÚ¸´ÔÓ¾Ö²¿¹Û²âµÄÐÂÐÍÇ¿»¯Ñ§**¿ò¼Ü¡£DPFRL¶ÔÉñ¾ÍøÂç²ßÂÔÖеĿÉ΢·ÖÁ£×ÓÂ˲¨Æ÷½øÐбàÂ룬ÒÔ½øÐÐÏÔÊ½ÍÆÀí£¬²¢ËæÊ±¼ä½øÐв¿·Ö¹Û²â¡£Á£×ÓÂ˲¨Æ÷ʹÓÃѧ**µÄÅбðʽ¸üÐÂÀ´Î¬³ÖÐÅÄ¸ÃÅбðʽ¸üо¹ý¶Ëµ½¶ËµÄѵÁ·ÒÔÓÃÓÚ¾ö²ß¡£ÊµÑé±íÃ÷£¬Ê¹ÓÿÉ΢·Ö¸üжø²»ÊDZê×¼Éú³ÉÄ£ÐÍ¿ÉÒÔÏÔ×ÅÌá¸ßÐÔÄÜ£¬ÓÈÆä¶ÔÓÚ¾ßÓи´ÔÓÊÓ¾õ¹Û²ìµÄÈÎÎñ£¬ÒòΪËüÃDZÜÃâÁ˽¨Ä£Óë¾ö²ßÎ޹صĸ´ÔÓ¹Û²âµÄÀ§ÄÑ¡£ÁíÍ⣬ΪÁË´ÓÁ£×ÓÐÅÄîÖÐÌáÈ¡ÌØÕ÷£¬ÎÒÃÇ»ùÓÚ¾ØÉú³Éº¯ÊýÌá³öÁËÒ»ÖÖÐÂÐ͵ÄÐÅÄîÌØÕ÷¡£ÔÚÏÖÓеÄPOMDP RL»ù×¼²âÊÔ¡¶Natural Flickering Atari¡·ÓÎÏ·ÖУ¬DPFRLÓÅÓÚ×îеÄPOMDP RLÄ£ÐÍ£»´ËÍ⣬DPFRLÔÚHabitat»·¾³ÖÐʹÓÃÕæÊµÊý¾Ý½øÐÐÊÓ¾õµ¼º½Ê±±íÏÖ³öÉ«¡£ 58.¡¶Episodic Reinforcement Learning with Associative Memory¡·¹Ø¼ü´Ê£ºEpisodic Control, Episodic Memory, Associative Memory, Non-Parametric Method, Sample EfficiencyHIGHLIGHT£ºÑù±¾Ð§ÂÊÒ»Ö±ÊÇÉî¶ÈÇ¿»¯Ñ§**µÄÖ÷ÒªÌôÕ½Ö®Ò»¡£Ñо¿ÕßÒÑÌá³ö·Ç²ÎÊý episodic control£¬Í¨¹ý¿ìËÙËø¶¨ÏÈǰ³É¹¦µÄ²ßÂÔÀ´¼ÓËÙ²ÎÊýÇ¿»¯Ñ§**¡£µ«ÊÇ£¬ÒÔǰµÄepisodicÇ¿»¯Ñ§**¹¤×÷ºöÂÔÁË״̬֮¼äµÄ¹ØÏµ£¬½ö½«¾Ñé´æ´¢Îª²»Ïà¹ØµÄÏΪÌá¸ßÇ¿»¯Ñ§**µÄÑù±¾Ð§ÂÊ£¬ÎÒÃÇÌá³öÁËÒ»¸öÐÂÓ±µÄ¿ò¼Ü¡ª¡ª´øÓÐÁªÏë¼ÇÒäµÄepisodicÇ¿»¯Ñ§**£¨ERLAM£©£¬¸Ã¿ò¼Ü½«Ïà¹ØµÄ¾Ñé¹ì¼£¹ØÁªÆðÀ´£¬ÒÔÖ§³ÖÍÆÀíÓÐЧµÄ²ßÂÔ¡£ÎÒÃÇ»ùÓÚ״̬ת»»ÔÚÄÚ´æÖеÄ״̬֮ÉϹ¹½¨Í¼ÐΣ¬²¢¿ª·¢·´Ïò¹ì¼£´«²¥²ßÂÔÒÔÔÊÐíֵͨ¹ýͼÐοìËÙ´«²¥¡£ÎÒÃÇʹÓ÷DzÎÊýÁªÏë¼ÇÒä×÷Ϊ²ÎÊýÇ¿»¯Ñ§**Ä£Ð͵ÄÔçÆÚÖ¸µ¼¡£Navigation domainºÍAtariÓÎÏ·µÄ½á¹û±íÃ÷£¬Óë×îеĴøÓÐÁªÏë¼ÇÒäµÄepisodicÇ¿»¯Ñ§**Ä£ÐÍÏà±È£¬ÎÒÃǵĿò¼ÜʵÏÖÁ˸ü¸ßµÄÑù±¾Ð§ÂÊ¡£ 59.¡¶Sub-policy Adaptation for Hierarchical Reinforcement Learning¡·¹Ø¼ü´Ê£ºHierarchical Reinforcement Learning, Transfer, Skill DiscoveryHIGHLIGHT£ºÎÒÃÇÌá³öÁËHiPPO£¬ÕâÊÇÒ»ÖÖÎȶ¨µÄ·Ö²ãÇ¿»¯Ñ§**Ëã·¨£¬¿ÉÒÔͬʱѵÁ·¶à¸ö²ã´ÎµÄ²ã´Î½á¹¹£¬´Ó¶øÔÚ¼¼ÄÜ·¢ÏÖºÍÊÊÓ¦·½Ãæ¾ù¾ßÓÐÁ¼ºÃµÄ±íÏÖ¡£·Ö²ãÇ¿»¯Ñ§**Êǽâ¾öÏ¡Êè½±ÀøµÄ³¤ÆÚ¾ö²ßÎÊÌâµÄÒ»ÖÖÓÐǰ;µÄ·½·¨¡£²»ÐÒµÄÊÇ£¬´ó¶àÊý·½·¨ÈÔȻʹ½ÏµÍ¼¶±ðµÄ¼¼ÄÜ»ñÈ¡¹ý³ÌÓë¿ØÖÆÐÂÈÎÎñÖм¼ÄܵĽϸ߼¶±ðµÄѵÁ·Íѹ³¡£±£³Ö¼¼Ä̶ܹ¨»áµ¼ÖÂ×ªÒÆÉèÖÃÖгöÏÖÃ÷ÏԵĴÎÓÅ״̬¡£ÔÚÕâÏ×÷ÖУ¬ÎÒÃÇÌá³öÁËÒ»ÖÖ¼´Ê¹ÔÚ½ÓÊÜÐÂÈÎÎñѵÁ·Ê±Ò²¿É²»¶Ï½«ÆäÓë¸ü¸ßµÄˮƽÏàÊÊÓ¦µÄ·¢ÏÖÒ»×é¼¼ÄܵÄÐÂÓ±Ëã·¨¡£Ö÷Òª¹±Ï×£ºÊ×ÏÈ£¬ÎÒÃÇÍÆµ¼ÁËÒ»¸öеÄDZÔÚÒÀÀµ»ùÏßµÄÎÞÆ«·Ö²ã²ßÂÔÌݶȣ¬²¢ÒýÈëÁË·Ö²ã½ü¶Ë²ßÂÔÓÅ»¯£¨HiPPO£©£¬ÕâÊÇÒ»ÖÖÓÐЧÁªºÏѵÁ··Ö²ã½á¹¹¸÷¸ö¼¶±ðµÄ»ùÓÚ²ßÂԵķ½·¨¡£µÚ¶þ£¬ÎÒÃÇÌá³öÁËÒ»ÖÖѵÁ·time-abstractionsµÄ·½·¨£¬¿ÉÒÔÌá¸ßËù»ñ¼¼ÄܶԻ·¾³±ä»¯µÄ³°ôÐÔ¡£´úÂëºÍÊÓÆµÔÚ https://sites.google.com/view/hippo-rl ¡£´úÂ룺https://anonymous.4open.science/r/de105a6d-8f8b-405e-b90a-54ab74adcb17/±¾ÎÄÄ¿µÄÔÚÓÚѧÊõ½»Á÷£¬²¢²»´ú±í±¾¹«ÖÚºÅÔÞͬÆä¹Ûµã»ò¶ÔÆäÄÚÈÝÕæÊµÐÔ¸ºÔ𣬰æÈ¨¹éÔ×÷ÕßËùÓУ¬ÈçÓÐÇÖȨÇë¸æÖªÉ¾³ý¡£







