In the old days, data scientists were always complaining the on-premises Hadoop cluster doesn’t have enough resources to run their job. By running the big data job on AWS cloud, data scientists can launch their own EMR cluster with project tagging to get whatever size of EMR cluster they need for the project, while the cost is controllable and effective. Here is a typical architecture:
But of course, in my real case, it’s more complicated because I need to encrypt the data in both column level and file level from the client side before putting into S3. I also wrote a custom spot fleet worker to decrypt such data using custom map reduce logic, and all EMR or other process cluster must be run in isolated subnet without Internet access for security reason.