Table Web Scraping Issues With Python
Solution 1:
You can use to get all the table data webdriver
, pandas
and BeautifulSoup
.
from bs4 import BeautifulSoup
import requests
from selenium import webdriver
import pandas as pd
url = "https://fantasy.premierleague.com/player-list"
driver = webdriver.Firefox()
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html)
table = soup.find_all('table', {'class': 'Table-ziussd-1 fVnGhl'})
df = pd.read_html(str(table))
print(df)
Output will be:
[ Player Team Points Cost
0 Alisson Liverpool 99 £6.21 Ederson Man City 89 £6.02 Kepa Chelsea 72 £5.43 Schmeichel Leicester 122 £5.44 de Gea Man Utd 105 £5.35 Lloris Spurs 56 £5.36 Henderson Sheffield Utd 135 £5.37 Pickford Everton 93 £5.28 Patrício Wolves 122 £5.29 Dubravka Newcastle 124 £5.110 Leno Arsenal 114 £5.011 Guaita Crystal Palace 122 £5.012 Pope Burnley 129 £4.913 Foster Watford 113 £4.914 Fabianski West Ham 61 £4.915 Caballero Chelsea 7 £4.816 Ryan Brighton 105 £4.717 Bravo Man City 11 £4.718 Grant Man Utd 0 £4.719 Romero Man Utd 0 £4.620 Krul Norwich 94 £4.621 Mignolet Liverpool 0 £4.522 McCarthy Southampton 74 £4.523 Ramsdale Bournemouth 97 £4.524 Fahrmann Norwich 1 £4.4and so on........................................]
Solution 2:
The table you want to scrape is generated using Javascript, which is not executed when you do html = urlopen(url)
and thus not in the soup either.
There are many methods as how to get dynamically generated data. Check here for example.
Solution 3:
https://fantasy.premierleague.com/player-list uses Javascript to generate data to html. BeautifulSoup cannot scrape Javascript so we need to emulate real browser to load data. To do this you can use Selenium - In below code I user Firefox but you can use Chrome for example. Please check Selenium's documentation on how to get it running.
Script opens Firefox browser, pauses for 1 second ( to make sure that all Javascript data has loaded) and passes html to BeautifulSoup. You might need to pip install lxml
parser for script to run.
Then we look for all div', {'class':'Layout__Main-eg6k6r-1 cSyfD'
as those contain all 4 tables on the website. You may want to use Inspect Element
tool in your browser to check names of tables, div's to target your search.
Then you can call any of 4 divs and search for tr
in each.
from selenium import webdriver
import time
from bs4 import BeautifulSoup
browser = webdriver.Firefox()
browser.set_window_size(700,900)
url = 'https://fantasy.premierleague.com/player-list'
browser.get(url)
time.sleep(1)
html = browser.execute_script('return document.documentElement.outerHTML')
all_html = BeautifulSoup(html,'lxml')
all_tables = all_html.find_all('div', {'class':'Layout__Main-eg6k6r-1 cSyfD'})
print('Found '+ str(len(all_tables)) + 'tables')
table1_goalkeepers = all_tables[0]
rows_goalkeeper = table1_goalkeepers.tbody
print('Goalkeepers: \n')
print(rows_goalkeeper)
table3_defenders = all_tables[1]
print('Defenders \n')
rows_defencders = table3_defenders.tbody
print(rows_defencders)
browser.quit()
Sample output:
Goalkeepers:
<tbody><tr><td>Alisson</td><td>Liverpool</td><td>99</td><td>£6.2</td></tr><tr><td>Ederson</td><td>Man City</td><td>88</td><td>£6.0</td></tr><tr><td>Kepa</td><td>Chelsea</td><td>72</td><td>£5.4</td></tr><tr><td>Schmeichel</td><td>Leicester</td><td>122</td><td>£5.4</td></tr><tr><td>de Gea</td><td>Man Utd</td><td>105</td><td>£5.3</td></tr><tr><td>Lloris</td><td>Spurs</td><td>56</td><td>£5.3</td></tr><tr><td>Henderson</td><td>Sheffield Utd</td><td>135</td><td>£5.3</td></tr><tr><td>Pickford</td><td>Everton</td><td>93</td><td>£5.2</td></tr><tr><td>Patrício</td><td>Wolves</td><td>122</td><td>£5.2</td></tr><tr><td>Dubravka</td><td>Newcastle</td><td>124</td><td>£5.1</td></tr><tr><td>Leno</td><td>Arsenal</td><td>114</td><td>£5.0</td></tr><tr><td>Guaita</td><td>Crystal Palace</td><td>122</td><td>£5.0</td></tr><tr><td>Pope</td><td>Burnley</td><td>128</td><td>£4.9</td></tr><tr><td>Foster</td><td>Watford</td><td>113</td><td>£4.9</td></tr><tr><td>Fabianski</td><td>West Ham</td><td>61</td><td>£4.9</td></tr><tr><td>Caballero</td><td>Chelsea</td><td>7</td><td>£4.8</td></tr><tr><td>Ryan</td><td>Brighton</td><td>105</td><td>£4.7</td></tr><tr><td>Bravo</td><td>Man City</td><td>11</td><td>£4.7</td></tr><tr><td>Grant</td><td>Man Utd</td><td>0</td><td>£4.7</td></tr><tr><td>Romero</td><td>Man Utd</td><td>0</td><td>£4.6</td></tr><tr><td>Krul</td><td>Norwich</td><td>94</td><td>£4.6</td></tr><tr><td>Mignolet</td><td>Liverpool</td><td>0</td><td>£4.5</td></tr><tr><td>McCarthy</td><td>Southampton</td><td>74</td><td>£4.5</td></tr><tr><td>Ramsdale</td><td>Bournemouth</td><td>97</td><td>£4.5</td></tr><tr><td>Fahrmann</td><td>Norwich</td><td>1</td><td>£4.4</td></tr><tr><td>Roberto</td><td>West Ham</td><td>18</td><td>£4.4</td></tr><tr><td>Verrips</td><td>Sheffield Utd</td><td>0</td><td>£4.4</td></tr><tr><td>Kelleher</td><td>Liverpool</td><td>0</td><td>£4.4</td></tr><tr><td>Reina</td><td>Aston Villa</td><td>19</td><td>£4.4</td></tr><tr><td>Nyland</td><td>Aston Villa</td><td>11</td><td>£4.3</td></tr><tr><td>Heaton</td><td>Aston Villa</td><td>59</td><td>£4.3</td></tr><tr><td>Darlow</td><td>Newcastle</td><td>0</td><td>£4.3</td></tr><tr><td>Eastwood</td><td>Sheffield Utd</td><td>0</td><td>£4.3</td></tr><tr><td>Steer</td><td>Aston Villa</td><td>1</td><td>£4.3</td></tr><tr><td>Moore</td><td>Sheffield Utd</td><td>1</td><td>£4.3</td></tr><tr><td>Peacock-Farrell</td><td>Burnley</td><td>0</td><td>£4.3</td></tr></tbody>
Solution 4:
This page uses JavaScript
to add data but BeautifulSoup
can't run JavaScript
.
You can use Selenium to control web browser which can run JavaScript
Or you can check in DevTools
in Firefox
/Chrome
(tab: Network
) what url is used by JavaScript
to get data from server and use it with urllib
to get these data.
I choose this method (manually searching in DevTools
).
I found that JavaScript
gets these data in JSON
format from
https://fantasy.premierleague.com/api/bootstrap-static/
Because I get data in JSON
so I can convert to Python list/dictionary using module json
and I don't need BeautifulSoup
.
It needs more manual work to recognize structure of data but it gives more data then table on page.
Here all data about first player on the list Alisson
chance_of_playing_next_round = 100chance_of_playing_this_round = 100code = 116535cost_change_event = 0cost_change_event_fall = 0cost_change_start = 2cost_change_start_fall = -2dreamteam_count = 1element_type = 1ep_next = 11.0ep_this = 11.0event_points = 10first_name = Alisson
form = 10.0id = 189in_dreamteam = Falsenews =
news_added = 2020-03-06T14:00:17.901193Z
now_cost = 62photo = 116535.jpg
points_per_game = 4.7second_name = Ramses Becker
selected_by_percent = 9.2special = Falsesquad_number = None
status = a
team = 10team_code = 14total_points = 99transfers_in = 767780transfers_in_event = 9339transfers_out = 2033680transfers_out_event = 2757value_form = 1.6value_season = 16.0web_name = Alisson
minutes = 1823goals_scored = 0assists = 1clean_sheets = 11goals_conceded = 12own_goals = 0penalties_saved = 0penalties_missed = 0yellow_cards = 0red_cards = 1saves = 48bonus = 9bps = 439influence = 406.2creativity = 10.0threat = 0.0ict_index = 41.7influence_rank = 135influence_rank_type = 18creativity_rank = 411creativity_rank_type = 8threat_rank = 630threat_rank_type = 71ict_index_rank = 294ict_index_rank_type = 18
There are also information about teams, etc.
Code:
from urllib.request import urlopen
import json
#url = 'https://fantasy.premierleague.com/player-list'
url = 'https://fantasy.premierleague.com/api/bootstrap-static/'
text = urlopen(url).read().decode()
data = json.loads(text)
print('\n--- element type ---\n')
#print(data['element_types'][0])for item in data['element_types']:
print(item['id'], item['plural_name'])
print('\n--- Goalkeepers ---\n')
number = 0for item in data['elements']:
if item['element_type'] == 1: # Goalkeepers
number += 1print('---', number, '---')
print('type :', data['element_types'][item['element_type']-1]['plural_name'])
print('first_name :', item['first_name'])
print('second_name :', item['second_name'])
print('total_points:', item['total_points'])
print('team :', data['teams'][item['team']-1]['name'])
print('cost :', item['now_cost']/10)
if item['first_name'] == 'Alisson':
for key, value in item.items():
print(' ', key, '=',value)
Result:
---elementtype---1Goalkeepers2Defenders3Midfielders4Forwards---Goalkeepers------1---type :Goalkeepersfirst_name :Berndsecond_name :Lenototal_points:114team :Arsenalcost :5.0---2---type :Goalkeepersfirst_name :Emilianosecond_name :Martíneztotal_points:1team :Arsenalcost :4.2---3---type :Goalkeepersfirst_name :Ørjansecond_name :Nylandtotal_points:11team :AstonVillacost :4.3---4---type :Goalkeepersfirst_name :Tomsecond_name :Heatontotal_points:59team :AstonVillacost :4.3
Code gives data in different order then table but if you put it all in list or better in pandas DataFrame then you can sort it in different orders.
EDIT:
You can use pandas
to get data from JSON
from urllib.request import urlopen
import json
import pandas as pd
#url = 'https://fantasy.premierleague.com/player-list'
url = 'https://fantasy.premierleague.com/api/bootstrap-static/'# read data from url and convert to Python's list/dictionary
text = urlopen(url).read().decode()
data = json.loads(text)
# create DataFrames
players = pd.DataFrame.from_dict(data['elements'])
teams = pd.DataFrame.from_dict(data['teams'])
# divide by 10 to get `6.2` instead of `62`
players['now_cost'] = players['now_cost'] / 10# convert team's number to its name
players['team'] = players['team'].apply(lambda x: teams.iloc[x-1]['name'])
# filter players
goalkeepers = players[ players['element_type'] == 1 ]
defenders = players[ players['element_type'] == 2 ]
# etc.# some informationsprint('\n--- goalkeepers columns ---\n')
print(goalkeepers.columns)
print('\n--- goalkeepers sorted by name ---\n')
sorted_data = goalkeepers.sort_values(['first_name'])
print(sorted_data[['first_name', 'team', 'now_cost']].head())
print('\n--- goalkeepers sorted by cost ---\n')
sorted_data = goalkeepers.sort_values(['now_cost'], ascending=False)
print(sorted_data[['first_name', 'team', 'now_cost']].head())
print('\n--- teams columns ---\n')
print(teams.columns)
print('\n--- teams ---\n')
print(teams['name'].head())
# etc.
Results
--- goalkeepers columns ---
Index(['chance_of_playing_next_round', 'chance_of_playing_this_round', 'code',
'cost_change_event', 'cost_change_event_fall', 'cost_change_start',
'cost_change_start_fall', 'dreamteam_count', 'element_type', 'ep_next',
'ep_this', 'event_points', 'first_name', 'form', 'id', 'in_dreamteam',
'news', 'news_added', 'now_cost', 'photo', 'points_per_game',
'second_name', 'selected_by_percent', 'special', 'squad_number',
'status', 'team', 'team_code', 'total_points', 'transfers_in',
'transfers_in_event', 'transfers_out', 'transfers_out_event',
'value_form', 'value_season', 'web_name', 'minutes', 'goals_scored',
'assists', 'clean_sheets', 'goals_conceded', 'own_goals',
'penalties_saved', 'penalties_missed', 'yellow_cards', 'red_cards',
'saves', 'bonus', 'bps', 'influence', 'creativity', 'threat',
'ict_index', 'influence_rank', 'influence_rank_type', 'creativity_rank',
'creativity_rank_type', 'threat_rank', 'threat_rank_type',
'ict_index_rank', 'ict_index_rank_type'],
dtype='object')
--- goalkeepers sorted by name ---
first_name team now_cost
94 Aaron Bournemouth 4.5305 Adrián Liverpool 4.0485 Alex Southampton 4.5533 Alfie Spurs 4.0291 Alisson Liverpool 6.2--- goalkeepers sorted by cost ---
first_name team now_cost
291 Alisson Liverpool 6.2323 Ederson Man City 6.0263 Kasper Leicester 5.4169 Kepa Chelsea 5.4515 Hugo Spurs 5.3--- teams columns ---
Index(['code', 'draw', 'form', 'id', 'loss', 'name', 'played', 'points',
'position', 'short_name', 'strength', 'team_division', 'unavailable',
'win', 'strength_overall_home', 'strength_overall_away',
'strength_attack_home', 'strength_attack_away', 'strength_defence_home',
'strength_defence_away', 'pulse_id'],
dtype='object')
--- teams ---0 Arsenal
1 Aston Villa
2 Bournemouth
3 Brighton
4 Burnley
Name: name, dtype: object
Post a Comment for "Table Web Scraping Issues With Python"