Automating Wikipedia Jury process with C++ programs and Linux commands during the 2021 Months of African Cinema

Introduction

The AfroCine Project is a non-profit organization that seeks to improve the visibility of African cinematic content on the internet. With the support of Wikimedia Foundation, it organizes several events and contest in achieving this vision. One of such, and perhaps the most consistent and successful of those is the Months of African Cinema, which has been held every year since 2018. The last edition resulted in the addition of about 4000 articles to Wikipedia. While this sounds like good news, it means more work for the jury team as well as core team reviewers.

Every year,there is usually a need to assess all entries for the AfroCine Project. And this process usually involves volunteers scoring each entry. However, the cumulative score is quite repetitive and is prone to error when done manually. This was why in my capacity as contest coordinator, in conjunction with the head of jury, I decided to look for better ways of ensuring this can be done with more accuracy and speed.

The Data

So essentially we have a webpage, that contains articles and their populated score as given by the language jury. These scores are based on several considered parameters. We are only interested in the entire sum for a user.

afro.png

Parsing through the Webpage

There are two approaches to doing this. Depends on which works better for you, and the number of entries to be parsed. For very large data, it is recommended to use CLI browsers to parse through content in a web url, then save the result to a textfile. Example of this is given below:

Lynx -dump https://en.wikipedia.org/wiki/Wikipedia:WikiProject_AfroCine/Months_of_African_Cinema/Users_By_Articles

Or

W3m -dump https://en.wikipedia.org/wiki/Wikipedia:WikiProject_AfroCine/Months_of_African_Cinema/Users_By_Articles

Task 1: The command above will display the result to the console, how do you save it to a file?

Identifying the total score for each article

We now have a text-based file that contains our content of interest. But as stated earlier there are three parameters that was used by the jury to judge an entry: Article score: One point is given for every entry Reference score: 0.5 points is given per quality reference Size score: One point is given for every 2000 bytes added

One observation is that this values are usually after the equality sign (“=”), so if we can create a command that goes through the text file and looks for anywhere that it finds an equality sign, then picks the number that follows, the value is expected to be the total for that entry. The command should be intelligent enough to handle whitespaces before the equality sign, as some jury leads had their formatting in that manner.

grep -oP '=\K.*' extracting\ sum\ data\ from\ wiki.txt > extracted.txt

Summing up all the total

We now have all the individual sum in a document, but we need to sum everything for each particle user. Unfortunately, it is difficult to get a free online calculator that accepts file uploads. But what we can do is to create one. The command for this is given below:

#include <fstream>
#include <iostream>
#include <iomanip>
#include <cstring>
#include <cstdlib>
#include <ctime>
#include <vector>

using namespace std;

void readFile(double &sum, double &cnt, double &avg, ifstream &Datain);

bool isFileAvailable(ifstream &Datain);

int main(){
//declaring variables
double cnt=0, sum=0;
double avg=0.0;

ifstream Datain;
   Datain.open("official.txt");

bool b=isFileAvailable(Datain);
if (b)
{
readFile(sum, cnt, avg, Datain);
cout<<"No of Numbers in the File:"<<cnt<<endl;
cout<<"Sum of Numbers in the File: "<<sum<<endl;
cout<<"Average of Numbers:"<<avg<<endl;
}    
else {
cout<<"**File Not Found **"<<endl;
}
return 0;
}


bool isFileAvailable(ifstream &Datain) {
//checking wheher the file name is valid or not

if (Datain.fail()) {
return 0;
}

else {
return 1;
}
}

void readFile(double &sum, double &cnt, double &avg, ifstream &Datain) {
double num;

//reading the data from the file

while (Datain>>num){
cnt++;
sum+=num;
}
Datain.close();
avg=((double)sum)/cnt;
}

One good advantage of the automated approach is that there are verification checks that ensures data integrity. For example, it tells you the number of lines that got executed which should be same as the total number of entries for that user.

Screenshot from 2022-02-26 18-15-04.png

Future improvements

Article size score is a repetitive task that can be easily automated on-wiki. This should reduce the burden from the jury to just only reference assessment.

Additionally, putting everything together in a single script is also an area to explore. Note that even the c++ program was run from the terminal.