-
Notifications
You must be signed in to change notification settings - Fork 0
/
google_play_store_app_review.py
668 lines (431 loc) · 18.5 KB
/
google_play_store_app_review.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
# -*- coding: utf-8 -*-
"""google play store app review
Automatically generated by Colaboratory.
Original file is located at
https://colab.research.google.com/drive/1PFhjGkIiU9roRPH36s5iSp4W9Ppn4LoO
**EDA** **on** **play** **store** **app** **review**
**by** **Magesh** **Babu**
**BUSINESS CONTEXT**
The Play Store app has enormous potential to drive app-making business success. Actionable insights can be drawn for devlopers to wok on and captue the android market. Each app(row) has value for category, rating,size,and more .another dataset contain customer reviews of the android apps.explotre and analuse the data to discover key factors responsible for app engagement and succes.
**Problem statement**
. the play store apps data has enormous potential to drive app making bussiness to success actionable insights can be drwan for devlopers to work on and capture the android market
. Each app(row) has value for category, rating,size,and more .another dataset contain customer reviews of the android apps.
.explotre and analuse the data to discover key factors responsible for app engagement and succes.
**importing important packages**
"""
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# importing the datetime library from date time import date time
# ignoring warnings
import warnings
warnings.filterwarnings('ignore')
"""**exploring play store data**
**loading the data set**
"""
df = pd.read_csv('/Play Store Data.csv')
df.head()
"""**Data Discription**
1. App : contains the name of the app with a short description(optional)
2. category: it gives the category to the app
3. Rating: it contains the avarage rating the respective app recived from its users
4. Reviews: it contains the number of users that have dropperd a review for the respective app
5. Size: it conatins the disk space required to install the respecitve app.
6. Instslls: it gives the rounded figue of numbers of time the respective app was downloaded
7. Types: it satates whether an app is free to use or paid.
8. Price: It gives the price payable to install the app. For free type apps, the price is zero.
9. Content rating: It states whether or not an app is suitable for all age groups or not.
10. Genres: It gives the genre(s) to which the respective app belongs.
11. Last updated: It gives the day in which the latest update was released.
12. Current Ver: It gives the current version of the respective app.
13. Android Ver: It gives the android version of the respective app.
"""
#getting the information
df.info()
#getting the shape
df.shape
#finding the duplicate value
dup = df.duplicated().value_counts()
dup
#visualising through bar graph
plt.figure(figsize= (10,6))
dup.plot(kind='bar',color=['g','r'])
plt.xticks(rotation=360)
plt.title("duplicate value")
#droping the duplicate value
df= df.drop_duplicates()
df.duplicated().value_counts()
"""**Find the null values**"""
df.isnull().sum()
#visualizing null values through heatmap.
plt.figure(figsize=(25,10))
sns.heatmap(df.isnull(),cbar= False,yticklabels=False,cmap='viridis')
plt.xlabel("name of columns")
plt.title("place of missing values in column")
#finding the unique values
print(df.apply(lambda col: col.unique()))
df['Type'].value_counts()
df['Type'].unique()
df[df['Type'].isnull()]
"""**Since the Nan value in type belomgs to price 0 which means it should be of type free**"""
df['Type'].replace(np.nan,'free,inplace = true')
#hence the null value is being replaced
df[df['Type'].isnull()]
"""**treating null values rating column**"""
#how many null value are there
df["Rating"].isnull().sum()
#lets find the mean and median of it
mean_rating = df['Rating'].mean()
median_rating = df['Rating'].median()
round(mean_rating,1),round(median_rating,2)
#lets check the box plot for its outliers
plt.figure(figsize=(15,6))
sns.boxplot(df['Rating'])
"""**Since there are a lot of outliers and we know that mean is affected by Outliers and not the median, hence we will replace the null values with median**"""
df['Rating'].replace(np.nan,df['Rating'].median(),inplace=True)
#checkingfro null values now
df['Rating'].isnull().sum()
"""**Hence all the null values are replaced with median and now lets take care of outliers**"""
# Listng all the bottom 5 values
sorted(df['Rating'])[-5 :]
df[df['Rating'] == 19.0]
"""Since the rating cannot be 19 and also category cannot be 1.9
The entire row is misplaced because of one value of category column is missing so its better to drop the entire row.
t
"""
#checking the shape before dropping
df.shape
#dropping the row number 10472
df=df.drop(10472)
#check the shape after dropping
df.shape
#lets checck the boxplot for its outlier
plt.figure(figsize = (15,6))
sns.boxplot(df['Rating'],color ='orange')
plt.title("outliers")
plt.grid()
"""**Observation
Since according to the formula there may be outliers but the rating usually range between 1 to 5 and we can see that there are no values beyond the
range. So not dropping the outliers.
Checking for null values for Current Version and Android Version
**
"""
df.isnull().sum()
df['Current Ver'].unique()
df
"""**Since there are only 8 null values in current version and 2 in android version hence either we can replace it or drop it. lets replace it with Varies with device**"""
df['Current Ver'].replace(np.nan, 'Varies with device', inplace = True)
df['Android Ver'].replace(np.nan, 'Varies with device', inplace = True)
df.isnull().sum()
"""**Let's change the date time format**"""
# The datetime.strptime funtion applied to the values in the last updated column to convert datatype from string to datetime
from datetime import datetime
df['Last Updated'] = df['Last Updated'].apply(lambda x: datetime.strptime(x, '%B %d, %Y'))
df.head()
df.info()
"""**The column Installs contain unnecessary characters like come (,) and plus (+) which has to be removed.**"""
df['Installs'].value_counts()
df['Installs'] = df['Installs'].str.replace(r"[+,]", '')
df['Installs'].value_counts()
# Changing the datatype of Installs from object to int
df['Installs'] = df['Installs'].astype(int)
df.info()
"""**Defining a function to convert all the entries in KB to MB and then converting them to float datatype **bold text**
We can see that the values in the Size column contains data with different
units. 'M' stands for MB and 'k' stands for KB. To easily analyse this column, it is necessary to convert all the values to a single unit. In this case, we will convert all the units to MB.
We know that 1MB = 1024KB, to convert KB to MB, we must divide all the values which are in KB by 1024.**
"""
def kb_to_mb(val):
try:
if 'M' in val:
return float(val[:-1])
elif 'k' in val:
return round(float(val[:-1])/1024, 2)
else:
return val
except:
return val
# The kb_to_mb funtion applied to the size column
df['Size'] = df['Size'].apply(lambda x: kb_to_mb(x))
df.head()
df['Size'].value_counts()
# Plottinfg the boxplot for the Size column except 'Varies with Device'
size_new = df[df['Size'] != 'Varies with device']['Size']
plt.figure(figsize = (15,6))
sns.boxplot(size_new, color = 'orange')
plt.title("Size Outliers")
plt.grid()
"""
**There are outliers but we cannot remove them as they are the size of an app which can be as high as 100 mb and as low as 1mb also**"""
df.info()
"""**The price column contain dollar sign which is a special character hence have to drop it because the machine wont understand the dollar sign as currency.**
**Also Changing the type of Price column from object to float**
"""
df['Price'].unique()
df['Price'] = df['Price'].str.replace(r"[$]", '')
df['Price'].unique()
df['Price'] = df['Price'].astype(float)
"""**Also changing the datatype of Reviews to float**"""
df['Reviews'] = df['Reviews'].astype(float)
df.info()
"""**Describing the Play Store columns**"""
df.describe().style.background_gradient()
"""**Correlation**"""
plt.figure(figsize=(15,10))
cmap = sns.diverging_palette(220, 10, as_cmap=True)
sns.heatmap(np.round(df.corr(),2),annot=True, cmap=cmap)
"""**Exploratory** **Data** **Analysis**
**Univariate Analysis**
1. **Which Category is most preffered by people?**
"""
df.head()
df['Category'].value_counts()
plt.figure(figsize = (10,12))
df['Category'].value_counts().plot(kind = 'barh', color = 'g').invert_yaxis()
plt.title('Most prefered category')
"""**Observation**
Looks like people like Family category for downloading the app.
Second best category is gaming.
**2. What is the overall ratings for an app?**
"""
df['Rating'].value_counts()
fig, ax = plt.subplots(figsize =(10, 7))
ax.hist(df['Rating'], bins = [1, 2, 3, 4, 5], color = 'g')
plt.title("Total Rating ")
"""**Observation**
Almost more than 80% of the app in playstore lies between the range of 4 - 5
3. **How many Installation happened? bold text **
"""
df.head()
df['Installs'].value_counts().reset_index()
plt.figure(figsize = (18,8))
sns.barplot(data = df, x = df['Installs'].value_counts().keys(), y = df['Installs'].value_counts())
plt.xticks(rotation = 45)
plt.title("Install Counts")
plt.xlabel("Installs");
"""**Observations**
There are 1488 apps with more than 10,00,000 downloads/ Installs.
almost same amount of apps have 1,00,00 and 100,00,000 downloads/ Installs.
**4. Find the top free apps**
"""
df.head()
# Filtering out free apps
free_apps = df[df['Type'] == 'Free']
free_apps['Type'].value_counts()
# Sorting it with Installs
top_free_apps = free_apps[free_apps['Installs'] == free_apps['Installs'].max()]
top_free_apps.head()
top_free_apps.shape
top_free_apps['Category'].value_counts()
# Visualizing using barplot
plt.figure(figsize = (18,8))
sns.barplot(data = top_free_apps, x = top_free_apps['Category'].value_counts().keys(),
y = top_free_apps['Category'].value_counts())
plt.xticks(rotation = 45)
plt.title("Top Free App category")
plt.xlabel("Category")
plt.ylabel("Count")
plt.show()
"""**Observation**
Communication is the category which has a lot of interest of people when it comes to free apps.
Followed by the social category apps.
**5. Find the top Paid app text**
"""
df['Type'].unique()
# Filtering out paid apps
paid_apps = df[df['Type'] == 'Paid']
paid_apps['Type'].value_counts()
paid_apps.sort_values(["Price"],
axis = 0, ascending = False,
inplace = True,
na_position = "first")
paid_apps
paid_apps['Price'].value_counts()
plt.figure(figsize = (10,20))
sns.barplot(data = paid_apps, x = paid_apps['Price'].value_counts(),
y = paid_apps['Price'].value_counts().keys(), orient = 'h')
plt.title("Paid apps count")
plt.xlabel("Count")
plt.ylabel("Price in Dollar")
plt.show()
"""**Observation**
The paid apps charge the users a certain amount to download and install the app. This amount varies from one app to another.
There are a lot of apps that charge a small amount whereas some apps charge a larger amount. In this case the price to download an app varies from USD 0.99 to USD 400
.
In order to select the top paid apps, it won't be fair to look just into the numer of installs. This is because the apps that charge a lower installation fee will be installed by more number of people in general.
Here a better way to determine the top apps in the paid category is by finding the revenue it generated through app installs.
This is given by:
6.**Content Rating**
"""
df.head()
df['Content Rating'].value_counts()
# Visualzing with the graph
plt.figure(figsize = (15,6))
sns.barplot(data = df, x = df['Content Rating'].value_counts().keys(), y = df['Content Rating'].value_counts())
plt.title("Content Rating")
plt.xlabel("Content Rating")
plt.ylabel("Count")
"""**Observation**
It looks like most of the apps are made for everyone and the real source of income for them is Ads.
7. **Genres**
"""
df['Genres'].value_counts().iloc[:15]
# Visualizing using pie chart.
textprops = {"fontsize":15} # Font size of text in pie chart
plt.figure(figsize = (9,9)) # fixing pie chart size
df['Genres'].value_counts().iloc[:15].plot(kind = 'pie', shadow = True, autopct='%1.1f%%', textprops =textprops)
plt.title("Genres")
"""**Observation **
Looks like the most liked Genre is Tools but other than that every other app has equal weightage of likings
**Bivariate Analysis**
**1. Find the top profitable app in terms of revenue**
"""
paid_apps.head()
# Creating a column called revenue
paid_apps['Revenue'] = paid_apps['Price'] * paid_apps['Installs']
paid_apps.head()
# Sorting the Revenie column in decending order
top_paid_apps = paid_apps.sort_values(["Revenue"],
axis = 0, ascending = False)
top_paid_apps.head()
plt.figure(figsize = (10,12))
sns.barplot(data = top_paid_apps, y =top_paid_apps['App'].iloc[:20], x = top_paid_apps['Revenue'].iloc[:20])
plt.title("Top 20 highest profitable apps")
plt.show()
"""**Observation**
Minecraft is the most profitable paid application followed by I'm rich
**2. What are the categories in which the top paid apps belong to?**
"""
plt.figure(figsize = (15,6))
sns.barplot(data = paid_apps, x = paid_apps["Category"],
y = paid_apps['Price'])
plt.xticks(rotation = 90)
plt.title("Category to which highest paid apps belong to")
plt.show()
"""**Observation**
The highest revenue generating category is Finance
# **Exploring User Review data **
**loading the data set**
"""
import pandas as pd
df = pd.read_csv('/User Reviews.csv')
df.head()
# Checking Shape
df.shape
#checking info
df.info()
# FInding mathematical calulation for numerical data
df.describe().style.background_gradient()
# Finding the duplicated value
dup = df.duplicated().value_counts()
dup
# Visualizing the duplicated value
plt.figure(figsize = (8,6))
dup.plot(kind = 'bar', color = ['r','g'])
plt.xticks(rotation = 360)
# Droping the duplicated value
df = df.drop_duplicates()
df.duplicated().value_counts()
# Checking the shape after dropping the duplicated value
df.shape
# Checking for null value
df.isnull().sum()
# Visulaizing null values through heatmap.
plt.figure(figsize=(25, 10))
sns.heatmap(df.isnull(), cbar=False, yticklabels=False,cmap='viridis')
plt.xlabel("Name Of Columns")
plt.title("Places of missing values in column")
"""**There are a lot of NaN values and we cannot just drop it.**"""
df[df['Translated_Review'].isnull()]
"""**We can say that the apps which do not have a review (NaN value insted) tend to have NaN values in the columns Sentiment, Sentiment_Polarity, and Sentiment_Subjectivity in the majority of the cases.**"""
# The rows corresponding to the NaN values in the translated_review column, where the rest of the columns are non null.
df[df['Translated_Review'].isnull() & df['Sentiment'].notna()]
"""**In the few exceptional cases where the values of remaining columns are non null for null values in the translated_Review column, there seems to be errors. This is because the Sentiment, sentiment ploarity and sentiment subjectivity of the review can be determined if and only if there is a corresponding review.**
**Hence these values are wrong and can be deleted altogather.**
"""
# Dropping all the null values
df = df.dropna()
# Now checking the shape
df.shape
# Lets check for the null values
df.isnull().sum()
"""**Now there are no null value we can start analyzing the data**"""
df.head()
"""**Exploratory Data Analysis**
**Univariate Analysis**
**1. What are the sentiment type for the apps?**
"""
df['Sentiment'].value_counts()
plt.figure(figsize = (10,6))
sns.barplot(data = df, x = df['Sentiment'].value_counts().keys(), y =df['Sentiment'].value_counts())
plt.title("Sentiment")
plt.ylabel("Count")
plt.show()
"""**Observation**
**Looks like most of the app has positive response from the user**
**2. Top apps with Sentiment**
"""
df.head()
app_sentiment = df.groupby(['App'])['Sentiment'].value_counts().iloc[:27]
app_sentiment
plt.figure(figsize = (6,10))
app_sentiment.plot(kind = 'barh')
"""**Observation**
**Looks like the app 10 Best fooods for you has highest positive review**
**3. Find the top 10 positive sentiment apps**
"""
df.head()
positive_sentiment = df[df['Sentiment'] == 'Positive']
positive_sentiment.head()
top_positive_sentiment = positive_sentiment.groupby('App')['Sentiment'].value_counts().nlargest(10)
top_positive_sentiment
plt.figure(figsize = (15,6))
top_positive_sentiment.plot(kind = 'bar', color = 'g')
plt.title("Top 10 positive sentiment apps")
# Visualizing using pie chart.
textprops = {"fontsize":15} # Font size of text in pie chart
plt.figure(figsize = (9,9)) # fixing pie chart size
top_positive_sentiment.plot(kind = 'pie', shadow = True, autopct='%1.1f%%', textprops =textprops)
plt.title("top_positive_sentiment")
top_positive_sentiment.keys()
"""# **Trying tree map for this**"""
!pip install squarify
import squarify
plt.figure(figsize = (20,10))
squarify.plot(sizes=top_positive_sentiment,alpha=0.8, label = top_positive_sentiment.keys(),
pad=1, text_kwargs={'fontsize': 12})
plt.axis("off")
plt.title("Top 10 positive sentiment apps")
"""**4. Find the top 10 Negetive sentiment apps**"""
Negetive_sentiment = df[df['Sentiment'] == 'Negative']
Negetive_sentiment.head()
top_negative_sentiment = Negetive_sentiment.groupby('App')['Sentiment'].value_counts().nlargest(10)
top_negative_sentiment
plt.figure(figsize = (15,6))
top_negative_sentiment.plot(kind = 'bar', color = 'y')
plt.title("Top 10 Negative sentiment apps")
# Visualizing using pie chart.
textprops = {"fontsize":15} # Font size of text in pie chart
plt.figure(figsize = (9,9)) # fixing pie chart size
top_negative_sentiment.plot(kind = 'pie', shadow = True, autopct='%1.1f%%', textprops =textprops)
plt.title("top_negative_sentiment")
plt.figure(figsize = (20,10))
squarify.plot(sizes=top_negative_sentiment,alpha=0.8, label = top_positive_sentiment.keys(),
pad=1, text_kwargs={'fontsize': 12})
plt.axis("off")
plt.title("Top 10 positive sentiment apps")
"""## **conclusion**
Percentage of free apps = ~92%
Percentage of apps with no age restrictions = ~82%
Most competitive category: Family
Category with the highest number of installs: Game
Category with the highest average app installs: Communicaction
Percentage of apps that are top rated = ~80%
There are 20 free apps that have been installed over a billion times
Minecraft is the only app in the paid category with over 10M installs. This app has also produced the most revenue only from the installation fee.
The median size of all apps in the play store is 12 MB.
The apps whose size varies with device has the highest number average app installs.
The apps whose size is greater than 90 MB has the highest number of average user reviews, ie, they are more popular than the rest.
Helix Jump has the highest number of positive reviews and Angry Birds Classic has the highest number of negative reviews.
"""