Wrapping Up
Aug 9, 2018 15:23 · 1120 words · 6 minute read
As the final GSoC week is about to come to a close, I decided to wrap things up, and finally leave the gsoc18-cython
branch in a usable state. This involved pushing minor local changes that had remained uncommitted over the course of the project, from experimental branches, reverting temporary edits already pushed to the remote repository, and thorough testing and blogging.
_evidence
There remains one significant edit that I have not been able to complete. This regards the _evidence
variable from the MRF
class, that I have written about previously. Cythonising this variable has turned out to be very tricky, and I have realised the final scope of the problem only now. Without explaining how I came to these realisations over the last month, let me assert the changes still required to be made.
- Type the
_evidence
variable to be anarray
of double precision floats instead of a list. - Replace all instances of
None
in the list with-1
, a special number being used to denote aNone
(unassigned) truth value. - Correspondingly, with
None
semantics modified, restore functionality by modifying conditional expressions, loops, assignments, etc. This is a very nontrivial step, and extremely difficult to debug. - It now becomes possible to make significant changes to the
truth
functions, by typing the parameters as arrays, and return values as floats. - These changes completely redefine the internal PracMLN API - learning and inference modules use
truth
functions expecting non-float,None
returns, and can’t handle the new -1 semantics. All of them need to be appropriately rewritten now!
In all my attempts to go through these steps, I invariably get stuck at number 3. The tricky bit here is that errors generated due to this propogate and are revealed in unrelated locations - exceptions being raised due to truth values being +2, for example. These are near impossible to mitigate, since it is difficult to tell how/when/where the value became +2.
I have been trying to work on this problem, but am currently stuck at step 3, because I didn’t execute step 2 properly. My work towards this end is on the evidence-truth branch. Hopefully this will eventually get merged into gsoc18-cython.
Enumeration-Ask
Apart from the changes mentioned here already, there was more (albeit limited) scope for optimisation. I made changes to exact.pyx
directly, slightly modifying the code to avoid repetitive computation.
For instance, I changed the soft_evidence_formula
from a traversing an entire list
def soft_evidence_formula(self, gf):
truths = [a.truth(self.mrf.evidence) for a in gf.gndatoms()]
if None in truths:
return False
return isinstance(self.mrf.mln.logic, FirstOrderLogic) and any([t in Interval('(0,1)') for t in truths])
to traversing part of an array.
cpdef bint soft_evidence_formula(self, gf):
cdef array.array truths = array.array('d', [-1] * len(gf.gndatoms()))
cdef int i = 0
for i, a in enumerate(gf.gndatoms()):
result = a.truth(self.mrf.evidence)
if result is None:
return False
truths[i] = result
return isinstance(self.mrf.mln.logic, FirstOrderLogic) and any([ ( t>0 and t<1 ) for t in truths])
Interval
I also noticed a significant amount of time being spent in the Interval
class (from util.pyx
). While the class itself couldn’t be optimised, it was easy to reduce its usage. I imagine that in the future, as more PracMLN algorithms are developed, they could be initially written in Python, using code such as Interval( ... )
, but could later by optimised via Cython, with the usage of such code reduced drastically.
For instance, from the example above, notice t in Interval('(0,1)')
has been replaced with ( t>0 and t<1 )
, reducing function call overhead, and the regex (r'(\(|\[|\])([-+]?\d*\.\d+|\d+),([-+]?\d*\.\d+|\d+)(\)|\]|\[)'
) parsing inside Interval
drastically. In the same file, exact.pyx
, I replaced if truth in Interval(']0,1['):
with if truth > 0 and truth < 1:
; with identical semantics.
Special Methods
I had mentioned earlier how various special methods had a lot of speedup potential. While the Cython documentation doesn’t allow for much optimisation of these, I managed to make some progress on this front.
It turns out that in the case of PracMLN, the major usage of these was __eq__
, in the logic module, which in turn used __str__
to test the equality of GroundAtom
instances.
def __str__(self):
return "%s(%s)" % (self.predname, ",".join(self.args))
def __eq__(self, other):
return str(self) == str(other)
This could be refactored so the __eq__
directly tested self.predname
and self.args
. This led to significant improvement.
def __eq__(self, other):
return (self.predname == other.predname) and (self.args == other.args)
Grounding
Lastly, I noticed that the itergroundings
function was significantly time consuming. I wanted to define this using cdef
, however the usage of yield
doesn’t allow this, despite Cython apparently adding support for this Python feature.
Compiling default.pyx because it changed.
[1/1] Cythonizing default.pyx
Error compiling Cython file:
------------------------------------------------------------
...
self._cacheinit()
cdef int counter = -1
while True:
counter += 1
if self.iscached and len(self._cache) > counter:
yield self._cache[counter]
^
------------------------------------------------------------
default.pyx:110:16: 'yield' not supported here
Error compiling Cython file:
------------------------------------------------------------
...
except StopIteration:
self.__cachecomplete = True
return
else:
self._cache.append(gf)
yield gf
^
------------------------------------------------------------
default.pyx:119:20: 'yield' not supported here
Traceback (most recent call last):
File "setup.py", line 5, in <module>
ext_modules=cythonize("*.pyx", compiler_directives={'profile': True})
File "/home/kaivalya/ ... /lib/python3.5/site-packages/Cython/Build/Dependencies.py", line 1026, in cythonize
cythonize_one(*args)
File "/home/kaivalya/ ... /lib/python3.5/site-packages/Cython/Build/Dependencies.py", line 1146, in cythonize_one
raise CompileError(None, pyx_file)
Cython.Compiler.Errors.CompileError: default.pyx
However, I converted (trivially) the DefaultGroundingFactory
into an extension type anyways. I also recall, in one of my early discussions with Daniel, that optimising the grounding module would eventually lead to a significant speedup. Many variables were typed in this class, however I think that if the cacheing semantics could be modified to avoid the use of None
, further speedup could be obtained by typing _cache
as an array.
cdef class DefaultGroundingFactory():
cdef public MRF mrf
cdef public list _cache
cdef int _cachesize
cdef int total_gf
cdef bint __cacheinit
cdef bint __cachecomplete
cdef dict _params
cdef dict __dict__
Miscellaneous Edits
I also made some miscellaneous edits over the course of this last week, more for the sake of consistency than actually measurable performance gain. For examle, I typed remaining MRF
attributes _variables
, _gndatoms
and _gndatoms_indices
, and redefining the gndatoms(self, l)
functions in common.pyx
with cpdef
. I wanted to optimise this further, but was limited by similar problems to those described for the truth functions in the previous blog post.
Conclusion
With these changes, my work on the gsoc18-cython
branch comes to a close. To make the code ready to work out of the box, I removed the profiling related compiler directives, and the [Cython]
print-upon-import statement I had inserted earlier. I hope to continue returning to this from time to time, with fresh ideas.
If you are a PracMLN user, I would urge you to checkout the gsoc18-cython
branch, and use it! I welcome any feedback, error-reports, or advice regarding further improvement. Please feel free to comment below this post to reach out to me. Cheers!