• Samuel Obadan
  • Zenghui Wang




Genetic algorithm, Evolutionary Neural networks, Reinforcement Learning, Particle filters, Markov decision processes, POMDPs


This paper introduces novel concepts for accelerating learning in an off-policy reinforcement learning algorithm for Partially Observable Markov Decision Processes (POMDP) by leveraging multiple agents frame work. Reinforcement learning (RL) algorithm is considerably a slow but elegant approach to learning in an unknown environment. Although the action-value (Q-learning) is faster than the state-value, the rate of convergence to an optimal policy or maximum cumulative reward remains a constraint. Consequently, in an attempt to optimize the learning phase of an RL problem within POMD environment, we present two multi-agent learning paradigms: the multi-agent off-policy reinforcement learning and an ingenious GA (genetic Algorithm) approach for multi-agent offline learning using feedforward neural networks. At the end of the trainings (episodes and epochs) for reinforcement learning and genetic algorithm respectively, we compare the convergence rate for both algorithms with respect to creating the underlying MDPs for POMDP problems. Finally, we demonstrate the impact of layered resampling of Monte Carlo’s particle filter for improving the belief state estimation accuracy with respect to ground truth within POMDP domains. Initial empirical results suggest practicable solutions.


S. Pouyanfar, S. Sadiq, Y. Yan, H. Tian, Y. Tao, M. Presa Reyes, M.-L. Shyu, S.-C. Chen, and S. S. Iyengar, “A survey on deep learning: Algorithms, techniques, and applications,” ACM Computing Surveys (CSUR), vol. 51, issue 5, pp. 92:1–92:36, 2018.

S. Obadan, Z. Wang, “A hybrid optimization approach for complex nonlinear objective functions,” International Journal of Computing, vol. 17, issue 2, pp. 102-112, 2018.

J. Pineau, G. Gordon, and S. Thrun, “Point-based value iteration: An anytime algorithm for POMDPs,” Proceedings of the Int. Jnt. Conf. on Artificial Intelligence, 2003, pp. 477–484.

N. Roy, G. Gordon, and S. Thrun, “Finding approximate POMDP solutions through belief compression,” J. Artificial Intelligence Research, vol. 23, pp. 1–40, 2005.

J. A. Bagnell, & J. Schneider, “Autonomous helicopter control using reinforcement learning policy search methods,” Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Seoul, South Korea, 2001, pp. 1615-1620.

D. Silver and J. Veness, “Monte-Carlo planning in large POMDPs,” Proc. Neur. Inform. Process. Sys., Vancouver, Canada, 2010, pp. 1–9.

C. Boutilier, & D. Poole, “Computing optimal policies for partially observable Markov decision processes using compact representations,” Proceedings of the 13th National Conference on Artificial Intelligence (AAAI-96), 1996, pp. 1168-1175.

E. Hansen, & Z. Feng, “Dynamic programming for POMDPs using a factored state representation,” Proceedings of the Fifth International Conference on Artificial Intelligence Planning and Scheduling (AIPS-00), Breckenridge, CO., 2000, pp. 130-139.

S. Omidshafiei, A.A. Agha-Mohammadi, C. Amato, S.Y. Liu, J.P. How and J. Vian, “Decentralized control of multi-robot partially observable Markov decision processes using belief space macro-actions,” The International Journal of Robotics Research, vol. 36, issue 2, pp. 231–258, 2017.

Z. Sunberg, and M. Kochenderfer, “POMCPOW: An online algorithm for POMDPs with continuous state, action, and observation spaces,” Proceedings of the Twenty-Eighth International Conference on Automated Planning and Scheduling, 2018, pp. 1-9.

H. Kurniawati and V. Yadav, “An online POMDP solver for uncertainty planning in dynamic environment,” Proceedings of the International Symposium of Robotics Research, 2013, pp. 611-629.

P. Sunehag, G. Lever, A. Gruslys, W. M. Czarnecki, V. Zambaldi, M. Jaderberg, M. Lanctot, N. Sonnerat, J. Z. Leibo, K. Tuyls, et al., “Value-decomposition networks for cooperative multi-agent learning,” arXiv preprint arXiv:1706.05296, 2017.

R. S. Sutton, and A. G. Barto, Reinforcement Learning: An Introduction, vol. 1. MIT Press Cambridge, 1998.

P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, and D. Meger, “Deep reinforcement learning that matters,” arXiv:1709.06560, 2017.

J. Schulman, P. Abbeel, and X. Chen, “Equivalence between policy gradients and soft Q-learning,” arXiv preprint arXiv:1704.06440, 2017.

J.-S. Gutmann, & D. Fox, “An experimental comparison of localization methods continued,” Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, Lausanne, Switzerland, 2002, vol. 1, pp. 454-459.

T. Patten, W. Martens, & R. Fitch, “Monte Carlo planning for active object classification,” Autonomous Robots, vol. 42, pp. 391–421, 2018. https://doi.org/10.1007/s10514-017-9626-0.

N. Cao, K. H. Low, & J. M. Dolan, “Multi-robot informative path planning for active sensing of environmental phenomena: A tale of two algorithms,” Proceedings of the 2013 International Conference on Autonomous Agents and Multi-Agent Systems AAMAS'13, May 2013, pp. 7–14.

G. Best, M. Forrai, R.R. Mettu and R. Fitch, “Planning-aware communication for decentralised multi-robot coordination,” Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), 2018, pp. 1050-1057.

M. Negnevitsky, Artificial Intelligence: a Guide to Intelligent Systems, second edition, Addison-Wesley, 2005.

D. E. Goldberg, Genetic Algorithms in Search, Optimization, and Machine Learning, Addison-Wesley, 1989.

J. H. Holland, K. J. Holyoak, R. E. Nisbett and P. R. Thagard. Induction: Processes of Inference, Learning, and Discovery, MIT Press, 1986.

L. Davis (ed.), Genetic Algorithms and Simulated Annealing, Morgan Kaufmann, 1987.

R. Belew, L. Booker (eds.), Genetic Algorithms, Proceedings of the Fourth International Conference, Morgan Kaufmann, 1991.

J. H. Holland, Adaptation in Natural and Artificial Systems, MIT Press, 1992.

J. Hertz, A. Krogh, and R. G. Palmer, Introduction to the Theory of Neural Computation. Santa Fe Institute Studies in the Sciences of Complexity, Lecture Notes, Addison-Wesley Longman Publ. Co., Inc., Reading, MA, 1991.

C. Bishop, M. Svensen, & C. Williams, “GTM: the Generative Topographic Mapping,” Neural Computation, vol. 10, issue 1, pp. 215-234, 1998.

J. H. Holland, “Complex adaptive systems,” The MIT Press on behalf of American Academy of Arts & Sciences, vol. 121, no. 1, pp. 17-30, Winter, 1992.

J.R. Koza, Genetic Programming: on the Programming of Computers by Means of Natural Selection, MIT Press, Cambridge, 1992

P.W. Tsai, T.K. Dao, et al., “Robot path planning optimization based on multiobjective grey wolf optimizer,” Proceedings of the International Conference on Genetic and Evolutionary Computing, Springer, 2016, pp. 166–173.

E. Shelhamer, P. Mahmoudieh, M. Argus, and T. Darrell, “Loss is its own reward: Self-supervision for reinforcement learning,” arXiv preprint arXiv:1612.07307, 2016.

R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour, “Policy gradient methods for reinforcement learning with function approximation,” Advances in Neural Information Processing Systems, vol. 12, pp. 1057–1063, 2000.

H. Li, X. Liao, and L. Carin, “Multi-task reinforcement learning in partially observable stochastic environments,” Journal of Machine Learning Research, vol. 10, pp. 1131-1186, 2009.

T. Li, H. Fan, S. Sun, “Particle filtering: Theory, approach, and application for multitarget tracking,” Acta Autom. Sin., vol. 41, pp. 1981–2002, 2015.

L. Martino, V. Elvira, G. Camps-Valls, “Group importance sampling for particle filtering and MCMC,” arXiv, arXiv:1704.0277, 2017.

S. Thrun, W. Burgard and D. Fox, Probabilistic Robotics, Cambridge, MA: MIT Press, 2005.

G. Best, J. Faigl and R. Fitch, “Online planning for multirobot active perception with self-organising maps,” Autonomous Robots, vol. 42, pp. 715–738, 2018. DOI: 10.1007/s10514-017-9691-4.

C. Browne, E. Powley, D. Whitehouse, S. Lucas, P. I. Cowling, P. Rohlfshagen, S. Tavener, D. Perez, S. Samothrakis, and S. Colton, “A survey of Monte Carlo tree search methods,” IEEE Trans. on Comput. Intell. and AI in Games, vol. 4, issue 1, pp. 1–43, 2012.

M. Toussaint, “Robot trajectory optimization using approximate inference,” Proceedings of the ACM Int. Conf. on Machine Learning, 2009, pp. 1049–1056.

Graeme, Best, et al., “Dec-MCTS: Decentralized planning for multi-robot active perception,” The International Journal of Robotics Research, vol. 38, issues 2-3, pp. 316-337, 2019.

H. Li, H. Gao, T. Lv, and Y. Lu, “Deep q-learning based dynamic resource allocation for self-powered ultra-dense networks,” Proceedings of the IEEE ICC Workshops, 2018, pp. 1–6.

N. C. Luong, D. T. Hoang, S. Gong, D. Niyato, P. Wang, Y.-C. Liang, and D. I. Kim, “Applications of deep reinforcement learning in communications and networking: A survey,” arXiv preprint arXiv:1810.07862, 2018.

Y. Lin, X. Dai, L. Li, and F.-Y. Wang, “An efficient deep reinforcement learning model for urban traffic control,” arXiv preprint arXiv:1808.01876, 2018.

N. Zhao, Y.-C. Liang, D. Niyato, Y. Pei, M. Wu, and Y. Jiang, “Deep reinforcement learning for user association and resource allocation in heterogeneous networks,” Proceedings of the IEEE International Conference GLOBECOM, Abu Dhabi, UAE, Dec. 2018, pp. 1–6.

J. Perolat, J. Z. Leibo, V. Zambaldi, C. Beattie, K. Tuyls, and T. Graepel, “A multi-agent reinforcement learning model of common-pool resource appropriation,” arXiv preprint arXiv:1707.06600, 2017.

P. Vadakkepat, K. C. Tan, and W. Ming-Liang, “Evolutionary artificial potential fields and their application in real time robot path planning,” Proceedings of the 2000 Congress on Evolutionary Computation, 2000, vol. 1, pp. 256–263.

W. Wang, J. Hao, Y. Wang, and M. Taylor, “Towards cooperation in sequential prisoner’s dilemmas: a deep multiagent reinforcement learning approach,” arXiv preprint arXiv:1803.00162, 2018.

T. Li, J.M. Corchado, S. Sun, H. Fan, “Multi-EAP: Extended EAP for multi-estimate extraction for SMC-PHD filter,” Chin. J. Aeronaut, vol. 30, pp. 368–379, 2017.

S. Obadan, Z. Wang, “A multi-objective optimization approach to robot localization of single and multiple emission sources, Procedia Manufacturing, vol. 35, pp. 755-761, 2019.

D. Sun, F. Geißer, and B. Nebel, “Towards effective localization in dynamic environments,” Proceedings of the 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Daejeon, Korea, October 2016, pp. 4517–4523.

S. Sun, T. Li, and T. P. Sattar, “Adapting sample size in particle filters through KLD-resampling,” Electronics Letters, vol. 49, no. 12, pp. 740–742, 2013.

G. Peng, W. Zheng, Z. Lu, J. Liao, L. Hu, G. Zhang, and D. He, “An improved AMCL algorithm based on laser scanning match in a complex and unstructured environment,” Complexity, vol. 2018, article ID 2327637, 2018.




How to Cite

Obadan, S., & Wang, Z. (2020). A MULTI-AGENT APPROACH TO POMDPS USING OFF-POLICY REINFORCEMENT LEARNING AND GENETIC ALGORITHMS. International Journal of Computing, 19(3), 377-386. https://doi.org/10.47839/ijc.19.3.1887