Wrapping Up

Aug 9, 2018 15:23 · 1120 words · 6 minute read

As the final GSoC week is about to come to a close, I decided to wrap things up, and finally leave the gsoc18-cython branch in a usable state. This involved pushing minor local changes that had remained uncommitted over the course of the project, from experimental branches, reverting temporary edits already pushed to the remote repository, and thorough testing and blogging.

_evidence

There remains one significant edit that I have not been able to complete. This regards the _evidence variable from the MRF class, that I have written about previously. Cythonising this variable has turned out to be very tricky, and I have realised the final scope of the problem only now. Without explaining how I came to these realisations over the last month, let me assert the changes still required to be made.

Type the _evidence variable to be an array of double precision floats instead of a list.
Replace all instances of None in the list with -1, a special number being used to denote a None (unassigned) truth value.
Correspondingly, with None semantics modified, restore functionality by modifying conditional expressions, loops, assignments, etc. This is a very nontrivial step, and extremely difficult to debug.
It now becomes possible to make significant changes to the truth functions, by typing the parameters as arrays, and return values as floats.
These changes completely redefine the internal PracMLN API - learning and inference modules use truth functions expecting non-float, None returns, and can’t handle the new -1 semantics. All of them need to be appropriately rewritten now!

In all my attempts to go through these steps, I invariably get stuck at number 3. The tricky bit here is that errors generated due to this propogate and are revealed in unrelated locations - exceptions being raised due to truth values being +2, for example. These are near impossible to mitigate, since it is difficult to tell how/when/where the value became +2.

I have been trying to work on this problem, but am currently stuck at step 3, because I didn’t execute step 2 properly. My work towards this end is on the evidence-truth branch. Hopefully this will eventually get merged into gsoc18-cython.

Enumeration-Ask

Apart from the changes mentioned here already, there was more (albeit limited) scope for optimisation. I made changes to exact.pyx directly, slightly modifying the code to avoid repetitive computation.

For instance, I changed the soft_evidence_formula from a traversing an entire list

def soft_evidence_formula(self, gf):
    truths = [a.truth(self.mrf.evidence) for a in gf.gndatoms()]
    if None in truths:
        return False
    return isinstance(self.mrf.mln.logic, FirstOrderLogic) and any([t in Interval('(0,1)') for t in truths])

to traversing part of an array.

cpdef bint soft_evidence_formula(self, gf):
    cdef array.array truths = array.array('d', [-1] * len(gf.gndatoms()))
    cdef int i = 0
    for i, a in enumerate(gf.gndatoms()):
        result = a.truth(self.mrf.evidence)
        if result is None:
            return False
        truths[i] = result
    return isinstance(self.mrf.mln.logic, FirstOrderLogic) and any([ ( t>0 and t<1 ) for t in truths])

Interval

I also noticed a significant amount of time being spent in the Interval class (from util.pyx). While the class itself couldn’t be optimised, it was easy to reduce its usage. I imagine that in the future, as more PracMLN algorithms are developed, they could be initially written in Python, using code such as Interval( ... ), but could later by optimised via Cython, with the usage of such code reduced drastically.

For instance, from the example above, notice t in Interval('(0,1)') has been replaced with ( t>0 and t<1 ), reducing function call overhead, and the regex (r'(\(|\[|\])([-+]?\d*\.\d+|\d+),([-+]?\d*\.\d+|\d+)(\)|\]|\[)') parsing inside Interval drastically. In the same file, exact.pyx, I replaced if truth in Interval(']0,1['): with if truth > 0 and truth < 1:; with identical semantics.

Special Methods

I had mentioned earlier how various special methods had a lot of speedup potential. While the Cython documentation doesn’t allow for much optimisation of these, I managed to make some progress on this front.

It turns out that in the case of PracMLN, the major usage of these was __eq__, in the logic module, which in turn used __str__ to test the equality of GroundAtom instances.

def __str__(self):
    return "%s(%s)" % (self.predname, ",".join(self.args))

def __eq__(self, other):
    return str(self) == str(other)

This could be refactored so the __eq__ directly tested self.predname and self.args. This led to significant improvement.

def __eq__(self, other):
    return (self.predname == other.predname) and (self.args == other.args)

Grounding

Lastly, I noticed that the itergroundings function was significantly time consuming. I wanted to define this using cdef, however the usage of yield doesn’t allow this, despite Cython apparently adding support for this Python feature.

Compiling default.pyx because it changed.
[1/1] Cythonizing default.pyx

Error compiling Cython file:
------------------------------------------------------------
...
            self._cacheinit()
        cdef int counter = -1
        while True:
            counter += 1
            if self.iscached and len(self._cache) > counter:
                yield self._cache[counter]
               ^
------------------------------------------------------------

default.pyx:110:16: 'yield' not supported here

Error compiling Cython file:
------------------------------------------------------------
...
                except StopIteration:
                    self.__cachecomplete = True
                    return
                else:
                    self._cache.append(gf)
                    yield gf
                   ^
------------------------------------------------------------

default.pyx:119:20: 'yield' not supported here
Traceback (most recent call last):
  File "setup.py", line 5, in <module>
    ext_modules=cythonize("*.pyx", compiler_directives={'profile': True})
  File "/home/kaivalya/ ... /lib/python3.5/site-packages/Cython/Build/Dependencies.py", line 1026, in cythonize
    cythonize_one(*args)
  File "/home/kaivalya/ ... /lib/python3.5/site-packages/Cython/Build/Dependencies.py", line 1146, in cythonize_one
    raise CompileError(None, pyx_file)
Cython.Compiler.Errors.CompileError: default.pyx

However, I converted (trivially) the DefaultGroundingFactory into an extension type anyways. I also recall, in one of my early discussions with Daniel, that optimising the grounding module would eventually lead to a significant speedup. Many variables were typed in this class, however I think that if the cacheing semantics could be modified to avoid the use of None, further speedup could be obtained by typing _cache as an array.

cdef class DefaultGroundingFactory():
    cdef public MRF mrf
    cdef public list _cache
    cdef int _cachesize
    cdef int total_gf
    cdef bint __cacheinit
    cdef bint __cachecomplete
    cdef dict _params
    cdef dict __dict__

Miscellaneous Edits

I also made some miscellaneous edits over the course of this last week, more for the sake of consistency than actually measurable performance gain. For examle, I typed remaining MRF attributes _variables, _gndatoms and _gndatoms_indices, and redefining the gndatoms(self, l) functions in common.pyx with cpdef. I wanted to optimise this further, but was limited by similar problems to those described for the truth functions in the previous blog post.

Conclusion

With these changes, my work on the gsoc18-cython branch comes to a close. To make the code ready to work out of the box, I removed the profiling related compiler directives, and the [Cython] print-upon-import statement I had inserted earlier. I hope to continue returning to this from time to time, with fresh ideas.

If you are a PracMLN user, I would urge you to checkout the gsoc18-cython branch, and use it! I welcome any feedback, error-reports, or advice regarding further improvement. Please feel free to comment below this post to reach out to me. Cheers!