Aug6 '16

with Di (Jack) Chen, NC State

Among all the work for analyzing the development of open source projcts in SE field, we find interview and text mining are the two ruling methods used by researches. Interviews give researchers directly feedback form people, while at a cost of time and even money in some situations. Text mining enable researchers to process massive dataset in a much faster way, but has the probelm of understanding natural languages. With crowdsourcing, we can analyze many big datasets that are hard for computer to understand, at quite a low cost in term of time and money.

Our experiments on Amazon Mechanical Turk have proved the feasibility of turk workers to analyze comments on Github. Now, we are leveraging the advantages of crowdsourcing and machine learning to analyze thousands of pull request discussions from GitHub, which contains tens of thousands comments. The final target is to find out:

  • What’s the cause for intense discussion around pull requests?
  • Are they disapproving the solutions being proposed or the problems being solved?
  • Do core members always agree with each other? What happens when disagreements happen?
  • How audience pressure affect pull requests’ results?
  • What role does alert/mention system play? Does that have influenced the final results?