Analysis of non-English communities on Steem

View this thread on: d.buzz | hive.blog | peakd.com | ecency.com
·@jacekw.dev·
0.000 HBD
Analysis of non-English communities on Steem
#### Repository
https://github.com/steemit/steem

#### Introduction
There are many communities on the Steem network that use a different language than English. For example, I am actively involved in the #polish community using the @jacekw account. The purpose of the following analysis is to find the most active communities of this type, compare them and observe how they have changed over time.

#### Outline
* Scope of the analysis
* Tools
* Verification of initial data
* Different tags for the same languages
* Posts
* Payouts
* Average payout per post
* Authors
* Tags with prefixes
* Conclusions
* Proof of work


#### Scope of the analysis

The data has been downloaded from the [SteemSQL](https://steemsql.com/) database (table`Comments`) and refer to the first 6 months of 2018. The following script was used to download posts from the given tag.
```
SELECT url, total_payout_value, active_votes, json_metadata, created, body_language
FROM Comments (NOLOCK) c
WHERE depth = 0 AND
      (CONTAINS(json_metadata, 'spanish') AND json_metadata LIKE '%"spanish"%') AND
      YEAR(created) = 2018 AND
      MONTH(created) <= 6
```


#### Tools
- [SteemSQL](https://steemsql.com/)
- [python 3.6](https://www.python.org/)
	- [matplotlib](https://matplotlib.org/)
	- [matplotlib_venn](https://pypi.org/project/matplotlib-venn/)
	- [jupyter notebook](http://jupyter.org/)


#### Verification of initial data

Potential language community tags have been manually selected based on 1000 most popular tags (from a 14 day time period).

```
lang_tags = [
    'indonesia', 'spanish', 'aceh', 'kr', 'cervantes', 'cn', 'deutsch', 'castellano', 'venezuela',
    'tr', 'polish', 'fr', 'myanmar', 'japanese', 'ru', 'pt', 'thai', 'ua',
    'morocco', 'arab', 'pilipinas', 'steemit-austria', 'mexico', 'vn', 'rusteemteam', 'cesky', 'bangladesh',
    'russian', 'hindi', 'br', 'arabic', 'teamserbia', 'steemromania', 'teamukraine', 'filipino', 'serbia'
]
```

There are different conventions here:
- country name e.g. #indonesia, #venezuela
- name of the language used, in English e.g. #spanish, #polish
- name of the language used, in the mother tongue e.g. #cervantes, #deutsch
- two-letter code e.g. #kr, #cn
- other name e.g. #steemit-austria, #rusteemteam

In my opinion, the name of the country is not the best choice for the language community tag, because this is the first tag that comes to mind if we want to add a post regarding given country in English.
The tags have been selected manually, so additional verification is needed, as some of them may not be relevant to language communities at all.

For this purpose, I used the column `body_language` from the table` Comments`. Below you will find charts showing the share of individual languages in given tags.

![](https://cdn.steemitimages.com/DQmduBEh4f9ofoF1PV5naif4JH4qnwPohAEHZieUqkoJgn8/image.png)
![](https://cdn.steemitimages.com/DQmVzBHqsxpwdFMUCkBFajPFqXSD77RCtppTHrsDkJuJhNS/image.png)
![](https://cdn.steemitimages.com/DQmdK1pxM6HHpUiEnkSLtHzmhJXD2cZGKHGQvottzK4nfEF/image.png)
![](https://cdn.steemitimages.com/DQmcuf8YZPMy5rojXoktBeZuR9HGMNMXRE97kEidxZ3yz1U/image.png)

We can see that some tags are dominated by English.  Such tags will be omitted from further analysis. Since we are not sure to what extent the `body_language` can be trusted, we will set the threshold quite low, at 30%. 

.|Tag|Ratio|Lang
-|-|-|-
1|castellano|94.1|es
2|br|92.5|pt
3|cervantes|89.4|es
4|kr|87.4|ko
5|thai|87.3|th
6|spanish|87.0|es
7|teamukraine|85.5|uk
8|venezuela|84.7|es
9|pt|81.0|pt
10|polish|79.2|pl
11|myanmar|77.4|my
12|tr|68.7|tr
13|japanese|68.4|ja
14|deutsch|67.9|de
15|fr|65.1|fr
16|steemit-austria|63.1|de
17|rusteemteam|61.2|ru
18|cesky|59.7|cs
19|indonesia|53.2|id
20|cn|51.0|zh
21|arabic|50.6|ar
22|ru|48.3|ru
23|mexico|47.4|es
24|aceh|46.2|id
25|morocco|45.6|ar
26|arab|41.2|ar
27|ua|34.6|uk
28|pilipinas|33.1|tl
29|russian|32.9|ru
30|hindi|30.1|hi
**-**|**-**|**-**|**-**
31|filipino|19.7|tl
32|vn|17.8|vi
33|serbia|15.9|sr
34|bangladesh|15.4|bn
35|teamserbia|15.0|sr
36|steemromania|13.9|ro

The last 6 tags will be deleted.
- #filipino
- #vn
- #serbia
- #bangladesh
- #teamserbia
- #steemromania

#### Different tags for the same languages

It also appears that some communities use the same language as others. Let's see what the relations are between them, whether they have a large common part.

---
```
es : castellano, cervantes, spanish, venezuela, mexico
pt : br, pt
uk : teamukraine, ua
de : deutsch, steemit-austria
ru : rusteemteam, ru, russian
id : indonesia, aceh
ar : arabic, morocco, arab
```
---
```
def plot_venn(tags):
    plt.figure(figsize=(6, 6))
    fn = venn2 if len(tags) == 2 else venn3
    fn([tag_urls_dict[tag] for tag in tags], map(lambda t: '#' + t, tags))
    plt.show()
    
for lang_code, tags in same_lang_dict.items():  
    if 2 <= len(tags) <= 3:
        plot_venn(tags)
            
plot_venn(['spanish', 'cervantes', 'castellano'])
plot_venn(['spanish', 'mexico', 'venezuela'])
```

##### pt : #br, #pt

<center>
![](https://cdn.steemitimages.com/DQmbEkZGzLKa4Le8GRjSnPHKjptJrW8hAzveuUhbjBVXRBg/image.png)
</center>

##### uk : #teamukraine, #ua

<center>
![](https://cdn.steemitimages.com/DQmXpjENfyQaiYzwi3xt6XGNVQJhDAGmE41Gs8fJJGi7QS6/image.png)
</center>

##### de : #deutsch, #steemit-austria

<center>
![](https://cdn.steemitimages.com/DQmaCcqBSYNMAWG4R6bRJQxSfZbpNzUdcCCZUZyPeYboQ1r/image.png)
</center>

##### ru : #rusteemteam, #ru, #russian

<center>
![](https://cdn.steemitimages.com/DQmPKNa4MjaUA8RZGCimghJeSKs9bpjmDFhgxX3YeeL4vm6/image.png)
</center>

##### id : #indonesia, #aceh

<center>
![](https://cdn.steemitimages.com/DQmNrrN7gFAyZfz8dg1V86YELTj7NwCPYivnbkbzcA4vfCd/image.png)
</center>

##### ar : #arabic, #morocco, #arab

<center>
![](https://cdn.steemitimages.com/DQmdZDgkwpNHu2rFAL1nduSZj2cbBAQtQ7kQjb9DMqN6Fuo/image.png)
</center>

##### es: #spanish, #cervantes, #castellano

<center>
![](https://cdn.steemitimages.com/DQmdjTyVx5kwAR7HEE87o7bMm56PHguGthzYBCrVmAWYrwL/image.png)
</center>

##### es : #spanish, #mexico, #venezuela

<center>
![](https://cdn.steemitimages.com/DQmQkkuWnCozuxwKSFdr3urQtQkKpukBNP6TjnFvopmDATV/image.png)
</center>

The #teamukraine tag is practically completely contained in #ua, so it is going to be omitted from further analysis, especially that both of them concern the Ukrainian community. For #pt and #br the situation is similar, but #br seems to be a separate (Brazilian) community.

#### Posts

Let's see how many posts were added in each tag.

.|Tag|Posts
-|--|-
1|indonesia|300522
2|spanish|282685
3|aceh|258482
4|kr|242179
5|cervantes|158545
6|cn|74577
7|deutsch|71291
8|castellano|61437
9|venezuela|58809
10|tr|57838
11|polish|29744
12|myanmar|20385
13|fr|20350
14|japanese|19592
15|ru|18445
16|pt|16875
17|thai|16681
18|ua|14414
19|morocco|8851
20|arab|6444
21|pilipinas|6389
22|steemit-austria|6386
23|mexico|4643
24|rusteemteam|4615
25|cesky|3956
26|russian|2432
27|teamukraine|2133
28|hindi|1990
29|br|1326
30|arabic|1240

![](https://cdn.steemitimages.com/DQmdm1JH2sTvoyMa4rn112q7huTQvQ2JBF9Z43Uj2LDFcSj/image.png)
![](https://cdn.steemitimages.com/DQmXdG68t9L29t6e4kVUiUEgbRujLpUQq5WoSvAhnATGYtu/image.png)
![](https://cdn.steemitimages.com/DQmUYPdEdadnY41YSPzXE7ifW5WCyj1kNHcm8CAndkfaJiG/image.png)

We can see a great diversity here, communities from leading places have several hundred times more posts than those from the end. The Spanish language tags are top of the list, which may indicate that Steem is well known in South America.

#### Author rewards

However, the number of posts is an insufficient indicator, because it is very easy to spam a given tag with low value posts, and this will not indicate the popularity of a given tag at all. Therefore, let's also look at the sum of rewards in individual tags.

.|Tag|Author rewards
-|--|-
1|kr|342070
2|spanish|166190
3|cn|158431
4|cervantes|105259
5|deutsch|86543
6|indonesia|81085
7|tr|63973
8|aceh|42160
9|fr|36596
10|castellano|35293
11|japanese|26803
12|pt|22386
13|myanmar|19214
14|polish|19053
15|ru|18731
16|venezuela|17815
17|steemit-austria|13182
18|ua|12459
19|thai|12041
20|morocco|9888
21|mexico|8887
22|arab|6835
23|br|4579
24|cesky|4246
25|pilipinas|3428
26|rusteemteam|2364
27|hindi|1325
28|russian|718
29|teamukraine|519
30|arabic|414

![](https://cdn.steemitimages.com/DQmbScz3HoFvxQEAkdv5xd2wLG39RKqyDGRSsj9iEdPCkU2/image.png)
![](https://cdn.steemitimages.com/DQmagoxSMQF6FjTSEqLZVoh4rXeZ2X68d2TYzBKi7LwjPWo/image.png)
![](https://cdn.steemitimages.com/DQmf3AnfgxfNWWWhk5iZZzwFSpFuWcRcEijGgepQjWkHLwo/image.png)

The table looks quite similar to the previous one. It is also worth to look at the average rewards in a given tag. This will allow us to find out how rich the community is.


#### Average payout per post

.|Tag|Average author rewards per post
-|--|-
1|br|3.453
2|cn|2.124
3|steemit-austria|2.064
4|mexico|1.914
5|fr|1.798
6|kr|1.412
7|japanese|1.368
8|pt|1.327
9|deutsch|1.214
10|morocco|1.117
11|tr|1.106
12|cesky|1.073
13|arab|1.061
14|ru|1.016
15|myanmar|0.943
16|ua|0.864
17|thai|0.722
18|hindi|0.666
19|cervantes|0.664
20|polish|0.641
21|spanish|0.588
22|castellano|0.574
23|pilipinas|0.537
24|rusteemteam|0.512
25|arabic|0.334
26|venezuela|0.303
27|russian|0.295
28|indonesia|0.270
29|teamukraine|0.243
30|aceh|0.163

![](https://cdn.steemitimages.com/DQmYhNg7GsYdhrcvAeu6MZUMyMmj93t5ScB51xEmd3yPctS/image.png)
![](https://cdn.steemitimages.com/DQmPNjCdS8mvAjCFWC4hr4s3Y9Srorz4cqFRqsRDxR3tJFy/image.png)
![](https://cdn.steemitimages.com/DQmb6RmQCg8LnHiCEasGvqeXR5hqT8pTjQX4XhMRi3h1E8e/image.png)


We can see that the differences are significant. It should also be taken into account, however, that tags with a small number of posts may not have reliable results.


#### Authors

We can look at the popularity of a given tag from a different angle - looking at the number of authors, not posts. This indicator seems even better, because it omits situations where a given user spams a very large number of posts.

.|Tag|Authors
-|--|-
1|spanish|48312
2|indonesia|42307
3|aceh|40846
4|cervantes|31520
5|kr|25622
6|venezuela|18329
7|castellano|15121
8|cn|9226
9|deutsch|9109
10|tr|6931
11|polish|4785
12|japanese|3118
13|fr|2976
14|myanmar|2851
15|ru|2083
16|thai|2060
17|mexico|1998
18|pt|1829
19|arab|1058
20|morocco|1039
21|steemit-austria|880
22|ua|854
23|pilipinas|846
24|russian|796
25|hindi|638
26|rusteemteam|458
27|arabic|441
28|cesky|369
29|br|291
30|teamukraine|81


Now we see how small some communities are! And if a community does not exceed a certain threshold, its members may prefer to post in English, which makes the growth of such a community more difficult.

![](https://cdn.steemitimages.com/DQmP7f3BXvhy52aL53V398stBpZevQ7PZJmQ6JviMXrKCSV/image.png)
![](https://cdn.steemitimages.com/DQmXnM3JUsCVzZLveu28WXzHpMUUoRq2Um6rkFsSw5PVHEh/image.png)
![](https://cdn.steemitimages.com/DQmdXaQj7ef4GsyrZbBVfDksXzCAxha2UHnC8wUSqgQeWUV/image.png)

#### Tags with prefixes

A common problem in language communities (I say this as someone who is actively involved in the #polish community) is how to use other tags. If I make a post in Polish and tag it: #polish #bitcoin #cryptocurrency, these last two tags will make the post reach also English-speaking audiences, so it is not a good solution.

Another idea seems to be to use tags in the mother tongue, but then you may find tags that function as a word in both languages, e.g. #film. This shows that it is not a good solution either. Therefore, the Polish community decided (based on the convention adopted by the Korean community) on the convention of tags with the prefix `pl-`. This makes it possible to separate posts in Polish from the English-speaking audience.

Let us therefore look at which other communities use such a convention. The table below shows how many unique tags are with the specified prefix. 


.|Prefix|Count
-|--|-
1|kr|2302
2|pl|815
3|cn|221
4|ru|168
5|jp|64
6|de|31
7|tr|24
8|pt|12
9|fr|8
10|ua|5
11|es|5

We can see that the Korean community has the largest number of such unique tags, followed by the Polish and Chinese communities. As far as the Polish community is concerned we can see a tree of these tags on the website: https://steemweb.pl/categories (by @rafalski)

Let's see what are the most popular tags of this type. 

.|Prefix-tag|Count
-|--|-
1|#kr-newbie|83291
2|#kr-life|23164
3|#kr-writing|19077
4|#kr-event|13164
5|#cn-reader|13133
6|#kr-daily|12263
7|#kr-art|8748
8|#kr-travel|7077
9|#kr-food|6037
10|#pl-artykuly|5987
11|#kr-pen|5678
12|#kr-coin|5494
13|#kr-join|5484
14|#kr-news|5011
15|#jp-newbie|4171
16|#kr-overseas|4055
17|#kr-diary|3812
18|#kr-gazua|3752
19|#kr-series|2980
20|#kr-funfun|2738
21|#kr-dev|2665
22|#kr-book|2601
23|#kr-youth|2563
24|#kr-story|2355
25|#cn-malaysia|1957
26|#kr-economy|1777
27|#kr-hobby|1708
28|#kr-steemit|1706
29|#kr-game|1693
30|#pl-fotografia|1612

As we can see, most of the tags are Korean. 

#### Conclusions

We have managed to find 30 language communities that do not use English. There are probably more, but the process of finding them is not a simple one, because of the different tagging conventions that have been adopted. All these communities are probably waiting for the appearance of the functionality [Communities](https://steemit.com/steemit/@steemitblog/applications-team-update-hivemind-communities-sign-ups-developer-tools-and-condenser-steemit-com), because this will make it easier for them to function. 

It also appears that the size of language communities is not always correlated with the number of people using a given language / country size. An example of this is tag #kr, which is high in all lists and has a relatively small population compared to e.g. the Chinese community.

Some communities use almost only their mother tongue, while others have chosen to use English (e.g. #serbia). It is also worth remembering that the Russian community decided at some point to move to its own blockchain: https://golos.io, but on the other hand, this did not make it completely disappear from the Steem network. Interestingly, there are also quite a big countries that do not have a language community.

There are many other issues that could be explored here, such as how the community formation process looks like, who the pioneers are, who the leaders are (if any).

#### Proof of work
[Scripts used in this work (as Jupyter Notebook)](https://github.com/jwladzinski/steem-communities/blob/master/lang.ipynb)
👍 , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,