1 00:00:00,000 --> 00:00:12,010 *36C3 preroll music* 2 00:00:12,010 --> 00:00:22,720 Andre Klapper: Alright, thank you. Thanks for your interest. I'm Andre, I'm with the 3 00:00:22,720 --> 00:00:28,130 Wikimedia Foundation, and one of the things I'm currently trying to find out is 4 00:00:28,130 --> 00:00:37,090 how to measure activity, people in our technical communities. And you probably 5 00:00:37,090 --> 00:00:42,020 know that Wikimedia is a large, large project. There's like more than 900 6 00:00:42,020 --> 00:00:47,680 websites, and there's many areas where you can contribute, technically, in different 7 00:00:47,680 --> 00:00:53,330 ways. And we're currently trying to get an overview. And even that is hard. 8 00:00:53,330 --> 00:01:02,280 So, it is a complex task. And in this talk, I would like to quickly show you what we already 9 00:01:02,280 --> 00:01:08,220 have in place, and what we want to get in place, and maybe also little bits of the 10 00:01:08,220 --> 00:01:14,030 problems and the complexity. So, it's more like, for your interest, or if you're 11 00:01:14,030 --> 00:01:24,260 curious also to play with technical metrics, statistics, things like these. 12 00:01:24,260 --> 00:01:30,830 What we have currently is, mostly is about git repositories, code repositories, and 13 00:01:30,830 --> 00:01:35,030 we mostly use Gerrit for code review. We have our own Gerrit instance at 14 00:01:35,030 --> 00:01:43,320 gerrit.wikimedia.org. And for this we've been having a platform called 15 00:01:43,320 --> 00:01:52,070 wikimedia.biterg.io. If you've seen a ElasticSearch, Kibana, standard platform 16 00:01:52,070 --> 00:01:58,979 thingy, this might be familiar to you. It is all Free and Open Source, it's actually 17 00:01:58,979 --> 00:02:03,259 a Linux Foundation project, you can find it under chaoss.community, chaoss with 18 00:02:03,259 --> 00:02:09,399 double s, and the code base is public on GitHub. So any other free and open source 19 00:02:09,399 --> 00:02:14,859 software project can also set this up for themselves. We have it hosted by Bitergia, 20 00:02:14,859 --> 00:02:19,019 but this is also possible to set up yourself, if you're interested in 21 00:02:19,019 --> 00:02:27,150 gathering statistics about your Free and Open Source project. And there's also a 22 00:02:27,150 --> 00:02:36,269 documentation page on MediaWiki.org which is called community metrics. I think I 23 00:02:36,269 --> 00:02:40,959 have screenshots here, because I never trust the Internet at conferences, but I 24 00:02:40,959 --> 00:02:47,319 could also show you live… so this is the GitHub page of the chaoss project by the 25 00:02:47,319 --> 00:02:55,010 Linux foundation where you could get the code. This is, I hope the zoom is 26 00:02:55,010 --> 00:03:03,699 sufficient, wikimedia.biterg.io So this is the overview page. You can see the 27 00:03:03,699 --> 00:03:12,790 navigation up here, and you get some basic statistics about the most active people in 28 00:03:12,790 --> 00:03:18,260 the git repositories, which organizations we have, so here you can see Wikimedia 29 00:03:18,260 --> 00:03:26,080 Foundation individuals, hello welt, Wikimedia Deutschland. So these are, this 30 00:03:26,080 --> 00:03:31,619 is the contributor base we have, by organization, by affiliation. And down 31 00:03:31,619 --> 00:03:37,620 here there's way more statistics, gits, Geritt, mailing lists, we index a lot of 32 00:03:37,620 --> 00:03:43,230 things. We also index a little bit our issue tracking system, which is 33 00:03:43,230 --> 00:03:51,469 phabricator, and some edits on MediaWiki.org. And, for example, now, if I 34 00:03:51,469 --> 00:03:58,999 go to Gerrit and the overview page, because we use Gerrit for code review, 35 00:03:58,999 --> 00:04:06,109 they have more specific statistics, and as it's ElasticSearch, Kibana based, you 36 00:04:06,109 --> 00:04:09,930 might know this if you've played with this, whenever you click on a certain 37 00:04:09,930 --> 00:04:15,029 value, you can filter by that value. So, for example, if I use the pie chart here, 38 00:04:15,029 --> 00:04:19,590 and only want to see the numbers for independent volunteer contributors, 39 00:04:19,590 --> 00:04:26,400 I click it, and you see the numbers now change. Obviously a bit lower, and you see 40 00:04:26,400 --> 00:04:30,530 up here, that a filter has been applied, and you can continue with these things. 41 00:04:30,530 --> 00:04:36,250 Then you can go filter here also via code repository, for example, the MediaWiki 42 00:04:36,250 --> 00:04:42,500 core repository. If I click on that one, it also filters for the value, and you can 43 00:04:42,500 --> 00:04:49,510 basically drill down the statistics you want to gather here. And there's, as I 44 00:04:49,510 --> 00:04:53,871 only have 15 minutes, there's way more things you can find out here, also, for 45 00:04:53,871 --> 00:05:02,600 example, who reviews patches in Gerrit, how long patches have been open, median 46 00:05:02,600 --> 00:05:08,870 time, all these things you might want to gather to find out how well are we doing 47 00:05:08,870 --> 00:05:15,540 as a project, when it comes to both involving volunteers, and also give them 48 00:05:15,540 --> 00:05:21,350 the feedback when it comes to code review, and engagement, that you would like to 49 00:05:21,350 --> 00:05:26,470 give. Or, also, areas for improvement. For example, in Wikimedia Foundation obviously 50 00:05:26,470 --> 00:05:33,100 we have engineering teams, and some of them maintain certain code repositories, 51 00:05:33,100 --> 00:05:39,261 so you can filter the view for certain code repositories, and then see, for 52 00:05:39,261 --> 00:05:44,640 example, you realize sometimes that patches written by volunteers, it takes 53 00:05:44,640 --> 00:05:49,130 longer to review them than patches written by your coworkers. And these kinds of 54 00:05:49,130 --> 00:05:54,180 things which you maybe already assumed, but it's nice to have actually data. 55 00:05:54,180 --> 00:06:02,810 There's also a few caveats here. So, for example, I usually don't use the git 56 00:06:02,810 --> 00:06:10,310 statistics, because Gerrit is where the code review happens. And once a patch 57 00:06:10,310 --> 00:06:15,430 proposed and Gerrit has been accepted and merged in the git repository, you would 58 00:06:15,430 --> 00:06:20,700 also see that in the git repository, but as all our software is Open Source, Free 59 00:06:20,700 --> 00:06:26,420 Software, we also of course pull in a lot of git repositories from other upstream 60 00:06:26,420 --> 00:06:31,020 projects, because we use a lot of software invented and maintained somewhere else to 61 00:06:31,020 --> 00:06:38,550 run our servers. So the git statistics also include activity that we've imported 62 00:06:38,550 --> 00:06:43,790 within the git repositories from other companies. So, that's kind of misleading. 63 00:06:43,790 --> 00:06:48,820 And there's a few more caveats, which are actually, I hope all of them are listed on 64 00:06:48,820 --> 00:06:54,350 the community metrics page on MediaWiki.org, because at some point I had 65 00:06:54,350 --> 00:07:01,230 to create a section "behavior that might surprise you". It also, that page also has 66 00:07:01,230 --> 00:07:05,820 some examples like, how can I, for the most common questions I get from 67 00:07:05,820 --> 00:07:12,820 interested people, and also co-workers, or, you want to publish an annual report, 68 00:07:12,820 --> 00:07:16,300 and show how many volunteer contributors you have in the code bases and these 69 00:07:16,300 --> 00:07:27,870 things. So that is what we have. These were the screenshots in case the Wi-Fi 70 00:07:27,870 --> 00:07:35,990 doesn't work. And now the section, what is patchwork. A spoiler: Basically everything 71 00:07:35,990 --> 00:07:43,120 else. Because this was the look at git and git repositories and Gerrit for code 72 00:07:43,120 --> 00:07:49,480 review. But there is way more going on when it comes to technical contributions 73 00:07:49,480 --> 00:07:58,590 and code in Wikimedia. There is GitHub. So, we have some projects, quite a few, 74 00:07:58,590 --> 00:08:02,461 that don't use Wikimedia git, Wikimedia Gerrit, but they prefer GitHub, because 75 00:08:02,461 --> 00:08:10,860 it's a different contribution system or workflow. So, we already track some of 76 00:08:10,860 --> 00:08:15,840 that, but we still have to improve even finding a way how to find all the 77 00:08:15,840 --> 00:08:20,100 repositories related to Wikimedia Development on GitHub. Because they're not 78 00:08:20,100 --> 00:08:27,090 all under the same organization. When it comes to what I just showed you, 79 00:08:27,090 --> 00:08:33,650 wikimedia.biterg.io, we define what is being indexed in a public JSON file, 80 00:08:33,650 --> 00:08:38,409 "projects". So, this is also linked from the community metrics page on 81 00:08:38,409 --> 00:08:43,379 mediawiki.org, where we define basically what's, what gets indexed. And it's a long 82 00:08:43,379 --> 00:08:50,579 list as you can say– see, also some mailing lists, but there's a lot of code 83 00:08:50,579 --> 00:08:57,149 actually on the Wikis. Inside of Wiki pages. So, there are user scripts, there 84 00:08:57,149 --> 00:09:02,830 are gadgets, like small JavaScript things that enhance functionality, and they're 85 00:09:02,830 --> 00:09:08,759 actually quite common. So, for example, Wikimedia Commons, or English or German 86 00:09:08,759 --> 00:09:15,059 Wikipedia, they have a lot of gadgets even enabled by default, which makes some 87 00:09:15,059 --> 00:09:22,279 behavior easier. For example, on Commons a common gadget is adding a category to a 88 00:09:22,279 --> 00:09:26,640 photo or image that has been uploaded. That's way easier if you use a gadget 89 00:09:26,640 --> 00:09:34,240 which is enabled by default. There are Lua modules, and there's templates. For 90 00:09:34,240 --> 00:09:39,241 example the info boxes that you see in many Wikipedia articles on the side, for 91 00:09:39,241 --> 00:09:43,839 example, if you look up a Wikipedia article about a person. These are all 92 00:09:43,839 --> 00:09:51,009 templates. And they're all stored on Wiki. So, this is harder to track, to get a full 93 00:09:51,009 --> 00:10:00,079 overview of that. And some extension code, even we have about 130 MediaWiki 94 00:10:00,079 --> 00:10:06,449 extensions deployed on Wikimedia servers. But if you take a look only at the 95 00:10:06,449 --> 00:10:11,860 extension home pages or MediaWiki.org, there is more than 2000. So there's a lot 96 00:10:11,860 --> 00:10:16,100 of code out there, and sometimes this code is even stored just by copy and paste 97 00:10:16,100 --> 00:10:20,510 putting it on a Wiki page, and saying: here, copy and paste this, and it should 98 00:10:20,510 --> 00:10:26,720 work. Which might not be the best revision system when it comes to maintaining code, 99 00:10:26,720 --> 00:10:33,139 ever, but it's a quick and dirty way, so these things exist. And one other example, 100 00:10:33,139 --> 00:10:40,199 unknown code repository locations. We also have something called ToolForge. That's 101 00:10:40,199 --> 00:10:44,920 what some people call "cloud services" nowadays. So you can host your own little 102 00:10:44,920 --> 00:10:50,579 helper tools which other people then can also use, on a cloud services platform 103 00:10:50,579 --> 00:10:55,069 called ToolForge that we offer. One example would be, for example, page views. 104 00:10:55,069 --> 00:11:02,770 So, if you want to see which pages are the most popular on some Wiki, that's one 105 00:11:02,770 --> 00:11:08,319 example out of, also thousands of tools now actually. And though, of course, the 106 00:11:08,319 --> 00:11:14,019 rules are that you must publish the source code, it's sometimes really hard to also 107 00:11:14,019 --> 00:11:18,249 make sure that this happens, and where it happens. So for most repositories, we 108 00:11:18,249 --> 00:11:23,329 know, we have an index, but for some we actually don't know, which is also 109 00:11:23,329 --> 00:11:31,790 something to work out. So, recently, even getting a number of things, or getting an 110 00:11:31,790 --> 00:11:38,790 idea, like, what what can we measure, what do we have, how much do we have, I started 111 00:11:38,790 --> 00:11:43,829 to create a table, and even visualizing that was, was an interesting task. I'm 112 00:11:43,829 --> 00:11:49,439 still not sure if anybody understands this, but black basically means doesn't 113 00:11:49,439 --> 00:11:55,970 exist. You don't need to, there is nothing to, to measure, to index. Green means, yes 114 00:11:55,970 --> 00:12:02,830 we do measure this already. And the red ones mean, yellow means, it's tricky, but 115 00:12:02,830 --> 00:12:09,459 it's kind of possible via some scripts or using the API to get numbers out of the 116 00:12:09,459 --> 00:12:15,420 Wikis, in certain name spaces, for example the module name space. And red means, it's 117 00:12:15,420 --> 00:12:22,600 very hard, but we'd like to get this data at some point. Plus, also the complexity, 118 00:12:22,600 --> 00:12:28,579 so the numbers you see here is sometimes correct numbers, sometimes more of a 119 00:12:28,579 --> 00:12:34,670 ballpark vague figure about how many items, code repositories, projects we're 120 00:12:34,670 --> 00:12:39,089 actually talking about. And with some numbers, we're even wondering. For 121 00:12:39,089 --> 00:12:46,199 example, it says 270 000 modules and templates on the 900 sites, websites 122 00:12:46,199 --> 00:12:53,019 we have on Wikimedia servers, and this is what the database query says on hive, but 123 00:12:53,019 --> 00:12:58,179 we're not really trusting that number yet. So, this is actually what we're going to 124 00:12:58,179 --> 00:13:03,139 be after over the next months to also have way better data, and a way better overview 125 00:13:03,139 --> 00:13:07,890 of where our developers actually are. Because we know, in code repositories, we 126 00:13:07,890 --> 00:13:17,209 have about 200 to 400 code contributors, in Gerrit code review, per month. 127 00:13:17,209 --> 00:13:24,480 And we now also know that we have about 500, 600 people who work on user scripts and 128 00:13:24,480 --> 00:13:30,619 gadgets, per year. But for many other things, we don't know yet, and that's what 129 00:13:30,619 --> 00:13:36,199 I'm trying to improve over the next months, or, maybe realistically, years. 130 00:13:36,199 --> 00:13:45,299 Let's see. But, yeah. So, that's basically it. I hope this was a bit interesting. 131 00:13:45,299 --> 00:13:51,089 If you have any comments, questions, feel free to catch me here. I'm sometimes 132 00:13:51,089 --> 00:13:56,329 around the table. Feel free to catch me after this talk. These are links with more 133 00:13:56,329 --> 00:14:03,019 information, or, if you don't manage to catch me, feel also free on the community 134 00:14:03,019 --> 00:14:09,110 metrics page on MediaWiki.org, the first link, there is a discussion page, and 135 00:14:09,110 --> 00:14:14,939 there you can also bring up anything, ideas, ask questions, I watch that page, 136 00:14:14,939 --> 00:14:18,149 and, usually, reply. Thank you! 137 00:14:18,149 --> 00:14:21,049 *applause* 138 00:14:21,049 --> 00:14:24,809 *postroll music* 139 00:14:24,809 --> 00:14:48,000 Subtitles created by c3subtitles.de in the year 2021. Join, and help us!