First -- profile for speed! Figure out where your code is slow/taking a lot of time so you can make informed choices on what areas to target for any optimization work.
-
scalene "Python Performance Matters" by Emery Berger (Strange Loop 2022) mentions quite a few different profilers. It's by one of the Scalene developers, so is pretty heavily about Scalene.
-
SnakeViz Is useful for reading
profileresults
https://stackoverflow.com/questions/4544784/how-can-you-get-the-call-tree-with-python-profilers lists some more profiling tools, including pyinstrument.
Profile memory usage too. We in CS used to talk about tradeoffs between space (memory) and time in CS. You could reduce time by using more space, and visa versa. In modern computer architectures this is often no longer true and improving (reducing) memory usage can also improve time.
- Some information about profiling memory usage: https://pythonspeed.com/datascience/#measuring-memory-usage
These all have the advantage of being smaller than standard classes, reducing memory usage and potentially improving speed.
If profiling indicates that your algorithm is slow, there are a few things you can do.
- Do a web search to see if anyone has solved your problem before, and see if they did it differently
- Rethink your algorithm, making sure it's not accidentally quadratic
- Use a simpler but "worse" algorithm
Regarding that last tip, surprisingly, a "worse" algorithm is sometimes faster for small input sizes.
Nice python interfaces for a mix of C, C++, Fortran
-
Useful if app is I/O-bound -- data input and output is a major cause of poor performance
-
Spin up multiple python processes, avoiding the GIL
-
Offers nice wrapper around multiprocessing and threading
These different ways of performing multiple tasks at once each have their place. Split your data up properly to get the most out of multithreading and multiprocessing.
- CRADLE anecdote about splitting multiprocessing data into fewer big chunks
- PyPy JITs python, plus lots of other optimizations.
PyPy won't always be a win, it depends a lot on the situation. See https://www.pypy.org/features.html for more information about appropriate use cases. It also only supports python up through 3.7.
- Cython Compiles python, or a very python-like language, to a C module
- mypyc Compiles type-annotated python to a C module
- Numba JIT python to machine code
- Taichi JIT python to machine code
A lot of these improvements are based on speeding up loops of code, numeric operations, and operations on arrays of numbers. They can perform transformations on code to make use of GPUs and CPU-specific features like SIMD.
Incidentally while neither HARDAC nor DASH/SOM-HPC support GPUs, DCC does! https://dcc.duke.edu/dcc/slurm/#gpu-jobs -- All the GPUs are Nvidia, so they support CUDA and OpenCL programming interfaces. It might be interesting to try out numba/taichi on DCC on a GPU node.
Using a new language means there are even more ways to speed up code.
- Julia
- Go
- Rust (integrate with python using PyO3 and maturin)
- I did this for data filtering in the portal
- Nim
- Java
- C#
- Or any other .Net language
- C/C++ (https://docs.python.org/3/extending/index.html#extending-index, https://docs.python.org/3/c-api/index.html)