python解析表格

环境

Operating System:Windows 10
Python Version:3.5.4

来源

好吧,这是一道编程题。断断续续的弄了好久才弄出来。小白真是伤不起。
原题链接在这里(http://www.codewars.com/kata/scraping-codewars-top-500-users/train/python)

题目内容如下:

Return a ‘Leaderboard’ object with a position property.
Leaderboard#position should contain 500 ‘User’ objects.
Leaderboard#position[i] should return the ith ranked User(1 based index).

Each User should have the following properties:

1
2
3
User#name    # => the user's username, not their real name
User#clan # => the user's clan, empty string if empty clan field
User#honor # => the user's honor points as an integer

Ex

1
2
3
4
an_alien = leaderboard.position[3]   # => #<User:0x3124da0 @name="myjinxin2015", @clan="China Changyuan", @honor=21446>
an_alien.name # => "myjinxin2015"
an_alien.clan # => "China Changyuan"
an_alien.honor # => 21446

妫版ɑ鍓?

  • 1.首先调用函数的时候返回一个LeaderBoard的对象,这个对象有position属性
  • 2.position属性是一个list,这个list由若干个user这样的对象
  • 3.user这个对象要有,name,clan,honor三个属性,可以多,不可以少
  • 4.这个list中的user信息要从500罗汉页面中去获取

鐟欙絽鍠呮潻鍥┾柤

  • 1.看到这个题,其实是很有兴趣的。算是一个小小的爬虫。打开页面,按下F12,看到如下所示内容
    表格
    红框中的就是题目要求抓取的内容,额,我自己加了两个
  • 2.然后这下问题就来了
    • i.页面的内容用什么获取呢?好吧这个是废话,题目告诉我们用BeautifulSoup获取了
    • ii.该怎么创建对象去满足题目的要求呢?
  • 3.对于页面内容获取
    • i.打开页面之后,发现要抓取的内容都在一个\标签里的,所以就考虑如何解析表格了获取表格内容
1
2
3
4
5
6
7
#好吧,我这里偷懒了。因为页面里面只有一个table对象,所以我直接查找所有的tr标签了,其中的[1:]是去掉表头
board=texts_soup.find_all('tr')[1:]
# 但是,其实我们可以换一个姿势
# 这样
board = bs.find_all(class_='leaderboard')[0]
# 或者这样
board = soup.find('div', attrs={'class': 'leaderboard'})

这里就把表格从页面中抽离出来了,然后需要获取指定的内容

1
2
3
4
5
6
7
8
9
cells =tr.find_all('td')
profileLink = cells[1].contents[1].get('href')
rank = cells[1].contents[0].find("span").contents[0]
profileImg = cells[1].contents[1].img['src']
name = cells[1].contents[1].contents[1]
clan = ""
if (len(cells[2].contents) > 0):
clan = cells[2].contents[0]
honor = cells[3].contents[0]

但是这样的方式,似乎看起来太繁琐,其实还可以这样

1
2
3
for row in board.find_all('tr')[1:]:
name = row.attrs['data-username']
rank, _, clan, honor = row.find_all('td')

  • 2.接着就需要创建一个对象,将获取到的数据,放到这个对象里面
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    class Leaderboard(list):
    def __init__(self):
    self._items = []
    super().__init__()

    def __getitem__(self, item):
    return self._items[item - 1]

    def __len__(self):
    return len(self._items)

    def add(self, value):
    self._items.append(value)

    #同时利用namedtuple模块,创建两个tuple类型的可访问对象
    user = namedtuple('User', 'name rank clan honor profile img')
    leader_board=namedtuple('Leaderboard','position')
    #实例化一下
    leader_board.position=Leaderboard()

然后我们再循环里面,把对应的数据保存进去

1
2
3
4
5
6
7
8
9
10
11
for tr in board:
cells =tr.find_all('td')
profileLink = cells[1].contents[1].get('href')
rank = cells[1].contents[0].find("span").contents[0]
profileImg = cells[1].contents[1].img['src']
name = cells[1].contents[1].contents[1]
clan = ""
if (len(cells[2].contents) > 0):
clan = cells[2].contents[0]
honor = cells[3].contents[0]
leader_board.position.add(user(name,rank,clan,int(honor),profileLink,profileImg))

最后将这个对象返回,就完成了。

代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
from bs4 import BeautifulSoup
from urllib import request
from collections import namedtuple

# import requests

URL = 'https://www.codewars.com/users/leaderboard'


class Leaderboard(list):
def __init__(self):
self._items = []
super().__init__()

def __getitem__(self, item):
return self._items[item - 1]

def __len__(self):
return len(self._items)

def add(self, value):
self._items.append(value)

def solution():
req_header=\
{
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36",
"Accept-Language": "zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3",
"Connection": "keep-alive"
}
target_req=request.Request(url=URL,headers=req_header)
target_response=request.urlopen(target_req)
target_html=target_response.read().decode('utf-8')
texts_soup=BeautifulSoup(target_html,'lxml')
trs=texts_soup.find_all('tr')[1:]

user = namedtuple('User', 'name rank clan honor profile img')
leader_board=namedtuple('Leaderboard','position')

leader_board.position=Leaderboard()
for tr in trs:
cells =tr.find_all('td')
profileLink = cells[1].contents[1].get('href')
rank = cells[1].contents[0].find("span").contents[0]
profileImg = cells[1].contents[1].img['src']
name = cells[1].contents[1].contents[1]
clan = ""
if (len(cells[2].contents) > 0):
clan = cells[2].contents[0]
honor = cells[3].contents[0]
leader_board.position.add(user(name,rank,clan,int(honor),profileLink,profileImg))
return leader_board
Author: Black_Jack
Link: https://foryl.github.io/blog/2018/04/25/python解析表格/
Copyright Notice: All articles in this blog are licensed under CC BY-NC-SA 4.0 unless stating additionally.