2 Million Password Analysis and Visualization

View this thread on: d.buzz | hive.blog | peakd.com | ecency.com
utopian-io·@avnigenc·7 years ago
0.000 HBD
2 Million Password Analysis and Visualization
#### Details

I used about 2 million  (completely 1,936,835) passwords. I made a general letter analysis on the 2 million  password dataset. I found the dataset on the internet and I will share it with you. I can say that there are passwords from different languages. I checked myself. However, the letter frequency of the verb did not come out like the English letter frequency. R programming language to make the data workable and visilizaliton.


#### Outline

1. Scope of Analysis
2. Tools
3. Scripts


#### Scope of Analysis
##### Charts
###### Letters Analysis
First, I searched for the distribution of the letters on the whole data set, and this bar graph was output.  The most commonly used letter "a". Used 1027141 times singular or plural.  The majority of Internet data is in [English](http://www.internetworldstats.com/stats7.htm).  Based on this information, we expect the frequency of letters to be close to the English letter frequency. 
When we do an English letter frequency analysis, we see that the most used letter "e",  but in my analysis the most commonly used letter is "a".

![600px-English_letter_frequency_%28alphabetic%29.svg.jpg](https://res.cloudinary.com/hpiynhbhq/image/upload/v1516141461/kwcwbuq7ezrn4st7nqde.png)

![a-z barchart.jpeg](https://res.cloudinary.com/hpiynhbhq/image/upload/v1516139987/wfqpquvlvslxxmxazmbn.jpg)

######  Numbers Analysis

When we do a number analysis on the dataset, we see what the most used digit is 1 (1272673 times).  The least used figure is 4 (565775 times).  We see a decline in the first four of the counting numbers. 
Then it starts to fluctuate.


![0-9 barchart.jpeg](https://res.cloudinary.com/hpiynhbhq/image/upload/v1516141626/xqfpzvofjz2y1i9ud4d4.jpg)

######  Uniques Analysis

Another curious thing was the distribution of unique characters. The most used unique character @ (12159 times) and it's followed by the exclamation point(10708 times).
PS: We didn't use the dot in the analysis.



![unique.jpeg](https://res.cloudinary.com/hpiynhbhq/image/upload/v1516142144/elu5lpghrvmzbcikn8am.jpg)


######  Upper-Lower Characters Analysis
We expected lower characters to be more, but we did not expect such an overwhelming result.  In the character-only analysis, upper letters cover only 5%.

![1.jpeg](https://res.cloudinary.com/hpiynhbhq/image/upload/v1516142807/fj3rcizl3can4qwrrozq.jpg)



On data screenshot:
![image.png](https://res.cloudinary.com/hpiynhbhq/image/upload/v1516145385/ing4jonmmjugoaqsecw6.png)




#### Tools

I used the R programming language,  R project and  few libraries.
Libraries:
ggplot2, plotly ->use for visualize the data
stringr ->string operations

#### Scripts
I will be issuing short sections of code because the code is quite long. You can find the complete project and dataset at on my [Github](https://github.com/avnigenc/letters-analysis-2mpass).

Bar chart plotting code.
```
barchart {a-----z}
barplot(harf_degerleri, xlab="Characters",ylab = "Values", main="[a-z] barchart", names.arg = chars, col="orange", ylim=c(1,1200000))
barchart {0-----9}
barplot(rakam_degerleri, xlab="Numbers",ylab = "Values", main="[0-9] barchart", names.arg = numbers, col="pink", ylim=c(1,1500000))
barchar {uniuqe}
barplot(uniq, xlab="Unique chars.",ylab = "Values", main="[unique] barchart", names.arg = uniq_names, col="grey", ylim=c(1,15000))
büyük harf toplam -> 594064 küçük harf toplam ->9346518
buyuk_harf_topam_sayı <- 594064
kucuk_harf_toplam_sayı <- 9346518
toplam_char_sayı <- buyuk_harf_topam_sayı+kucuk_harf_toplam_sayı
```





Pie chart plotting code.
```
harf_degerleri_total <- sum(harf_degerleri)
upper_chars_total <- sum(upper_chars)
pie_chart_values <-c(harf_degerleri_total,upper_chars_total)
colors <- c("#009E73","#E69F00")
pie(pie_chart_values,main="upper-lower chars.", col=colors, labels=labelss)
label1 <-paste(kucuk_harf_yuzde,"% lower chars.")
label2 <-paste(buyuk_harf_yuzde,"% upper chars.")
labelss <-c(label1,label2)
kucuk_harf_yuzde <- round((100*kucuk_harf_toplam_sayı)/toplam_char_sayı)
buyuk_harf_yuzde <- round((100*buyuk_harf_topam_sayı)/toplam_char_sayı)
```


    

<br /><hr/><em>Posted on <a href="https://utopian.io/utopian-io/@avnigenc/2-million-password-analysis-and-visualization">Utopian.io -  Rewarding Open Source Contributors</a></em><hr/>
👍 hsynterkr, cookie1225, arslanoch, maslo, serdarmert, amirdesaingrafis, merand,
properties (23)vote details (7)