A good way to begin learning assembly language is to take a small piece of time-critical C code, perhaps 20-30 lines, then disassemble it and work out how the C maps to the assembly. Draw a picture with registers, pointers and arrows, and really get to the bottom of what is going on.
Then you can start making the implementation more efficient by finding superfluous calculations and rewriting the C in a series of iterations to reduce the number of branches and instructions.
With the exception of a few operations such as rotate, C has all you need to write assembly code, with the advantages of being cross-platform and more readable. Every high-level C construct (for, while, array dereference) has a fairly obvious breakdown in terms of branches, gotos and adds.
By applying this technique for important parts of my code, I regularly get a speedup of 2-4x over gcc with its highest level of optimization.