-
Notifications
You must be signed in to change notification settings - Fork 718
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scaling Othrolearners using Ray #793
Comments
Thanks for sharing. Would you be able to share your findings from the performance analysis? |
@fverac yes we have done the performance analysis , we were able to run 1M units with about 500 covariates in ~7-8 Minutes over ray based implementation vs more than ~40min on current implementation on EC2-High Mem Node |
This is a great achievement @v-shaal ! I think if there is a way to seamlessly incorporate this Ray Remote function framework, we should strongly consider it! Do you know what it would take to incorporate in the library? Would you be willing to submit a PR with this improvement? |
@fverac @vsyrgkanis , I would be glad to work on this and raise a PR. I am currently going over the current structuring to figure out the best possible way to incorporate this with minimal changes to existing code structuring. let me know if you guys have any suggestions. |
@vsyrgkanis can you please assign this to me. |
@vsyrgkanis @fverac @kbattocchi , I've raised PR for this , kindly review and let me know the feedback |
Currently its challenging to scale Ortholearners for a large dataset as via current implementation of
_crossfit
is sequential which may not be efficient for large datapoints.To over come this we can use Ray Remote function (Ray Tasks) for remote and asynchronous invocations of each of the K folds simultaneously on separate Python workers.
This can be done via simply modifying the _crossfit in
_ortho_learner.py
.We conducted a performance analysis of the EconML implementation of DML and our version of DML_Ray at varying scales (10k, 100k, and 1Million) of treated units and using approximately 500 covariates generated by a synthetic data generator API sourced from https://github.com/py-why/dowhy/blob/main/dowhy/datasets.py
Here's the link of the Implementation of DML scaled via Ray that I have created. Let me know your thoughts .
@amit-sharma @emrekiciman
https://gist.github.com/vishal-d11/cd886eb6bdff96ad5a04711cb18339ed#file-dml_ray-ipynb
The text was updated successfully, but these errors were encountered: