Before you start working on complex multi-machine architectures, I’d try just increasing the number of workers and threads on your app with e.g.
$ gunicorn --workers 6 --threads 2 app:server
There is a longer discussion about this in here: Celery integration?, including the case of moving the CPUs to a different process (celery) which could be used to run them on a different machine as well.