Hi there,
I am using the following line of Code to get a DataFrame Output using Pandas :-
<pre>import pandas as pd
import requests
from bs4 import BeautifulSoup
import numpy as np
import datetime as dt
class work:
def __init__(self,link):
self.link=link
self.res=requests.get(self.link)
self.soup=BeautifulSoup(self.res.content, "lxml")
self.table = self.soup.find_all('table')[0]
self.l = pd.read_html(str(self.table))
def create(self):
self.ll=[]
for i in range(0,6):
l1=self.l[1][0:1][i]
l1=list(l1)
self.ll.extend(l1)
l2=self.l[1][2:]
self.date=list(l2[0])
self.location=list(l2[1])
self.lancaster=list(l2[2])
self.spitfire=list(l2[3])
self.hurricane=list(l2[4])
self.dakota=list(l2[5])
def month(self):
mm=self.l[1][1][1]
if mm=='May':
x=5
elif mm=='June':
x=6
elif mm=='July':
x=7
elif mm=='August':
x=8
elif mm=='September':
x=9
else:
x=0
return x
def refine(self):
self.create()
arr=np.asarray(self.date)
temp=arr[0]
for i in range(0,len(arr)):
if arr[i]=='nan':
arr[i]=temp
else:
temp=arr[i]
self.y=list(arr)
return self.y
def convert(self):
lx=[]
x=self.refine()
y=self.month()
for i in range(0,len(x)):
lx.append((dt.datetime(2006, y, int(x[i]))).strftime('%d-%b-%Y'))
return lx
def post(self):
date=self.convert()
dff = pd.DataFrame(list(zip(date,self.location,self.lancaster,self.spitfire,self.hurricane,self.dakota)),
columns =self.ll)
return dff
#a=work('http://web.archive.org/web/20050726230748/http://www.raf.mod.uk/bbmf/may05.html')
#b=work('http://web.archive.org/web/20050726230748/http://www.raf.mod.uk/bbmf/june05.html')
#c=work('http://web.archive.org/web/20050726230748/http://www.raf.mod.uk/bbmf/july05.html')
#d=work('http://web.archive.org/web/20050726230748/http://www.raf.mod.uk/bbmf/august05.html')
#e=work('http://web.archive.org/web/20050726230748/http://www.raf.mod.uk/bbmf/september05.html')
a=work('http://web.archive.org/web/20060811232523/http://www.deltaweb.co.uk/bbmf/may06.html')
b=work('http://web.archive.org/web/20060811232523/http://www.deltaweb.co.uk/bbmf/june06.html')
c=work('http://web.archive.org/web/20060811232523/http://www.deltaweb.co.uk/bbmf/july06.html')
d=work('http://web.archive.org/web/20060811232523/http://www.deltaweb.co.uk/bbmf/august06.html')
e=work('http://web.archive.org/web/20060811232523/http://www.deltaweb.co.uk/bbmf/september06.html')
#a=work('http://web.archive.org/web/20070701133815/http://www.bbmf.co.uk/may07.html')
#b=work('http://web.archive.org/web/20070701133815/http://www.bbmf.co.uk/june07.html')
#c=work('http://web.archive.org/web/20070701133815/http://www.bbmf.co.uk/july07.html')
#d=work('http://web.archive.org/web/20070701133815/http://www.bbmf.co.uk/august07.html')
#e=work('http://web.archive.org/web/20070701133815/http://www.bbmf.co.uk/september07.html')
#a=work('http://web.archive.org/web/20081116021904/http://www.bbmf.co.uk/may08.html')
#b=work('http://web.archive.org/web/20081116021904/http://www.bbmf.co.uk/june08.html')
#c=work('http://web.archive.org/web/20081116021904/http://www.bbmf.co.uk/july08.html')
#d=work('http://web.archive.org/web/20081116021904/http://www.bbmf.co.uk/august08.html')
#e=work('http://web.archive.org/web/20081116021904/http://www.bbmf.co.uk/september08.html')
dff1=a.post()
dff2=b.post()
dff3=c.post()
dff4=d.post()
dff5=e.post()
X = pd.concat([dff1, dff2], axis=0)
Y = pd.concat([X, dff3], axis=0)
Z = pd.concat([Y, dff4], axis=0)
F = pd.concat([Z, dff5], axis=0)
F=pd.DataFrame(F)
display = F[(F['Location'].str.contains('- Display')) & (F['Dakota'].str.contains('D')) & (F['Spitfire'].str.contains('S', na=True)) & (F['Lancaster'] != 'L')]
#Months = May Jun Jul Aug Sep
#Months = -05- -06- -07- -08- -09- #('[a-zA-Z]')) or #('- Display')) or #('- Display|Win'))
#display = F[(F['Location'].str.contains('[a-zA-Z]')) & (F['Date'].str.contains('Jul')) & (F['Dakota'].str.contains('D')) & (F['Spitfire'].str.contains('S', na=True)) & (F['Lancaster'] != 'L')]
pd.options.display.max_rows = 1000
pd.options.display.max_columns = 1000
display.drop('Lancaster', axis=1, inplace=True)
display=display.dropna(subset=['Spitfire', 'Hurricane'], how='all')
#display=display[['Date','Location','Dakota','Hurricane','Spitfire']]
display=display[['Location','Date','Dakota','Hurricane','Spitfire']]
display=display.fillna('--')
display.loc[86,'Location']='Windermere - Display' #'Windermere Air Show'
display.reset_index(drop=True, inplace=True)
display.to_csv(r'C:\Users\Edward\Desktop\BBMF Schedules And Master Forum Thread Texts\BBMF-2006-Code (Dakota With Fighters).csv')
display
I am doing a search for Displays only now for the Output DataFrame, so in the filtering of Rows, I use the following line of Code :-
display = F[(F['Location'].str.contains('- Display'))
And I also changed a Row, with a Location saying Windermere Air Show to Windermere - Display for that Row,
using the following line of Code :-
display.loc[86,'Location']='Windermere - Display'
However in the Output when I run my Code, all the - Display Rows only show which is correct, but
The Windermere - Display Row shows as :-
Windermere - Display NaN NaN NaN NaN
Do I need, to put inplace=True as part of the display.loc line of Code, for the Data in the Row to show ? And if so what should the line read, when that is incorporated ? Or if not what change do I need to make ?
I tried moving the position, of that .loc Code line in the full Code, to other positions, but that made no difference, and I still get the Column values as NaN's in my Output. The Index position number '86' is correct, so an incorrect number for that, isn't the issue.
If I use the following Line of Code :-
F[(F['Location'].str.contains('- Display|Win'))
I get the correct DataFrame Output, with the Windermere - Display Row, properly showing in the
correct position. But I would like to get the DataFrame Output I want, without including the |Win in that Line of Code, if possible. If someone could direct me, to what change(s) I need to make to achieve that, I would be very grateful.
I would like to know, why the Line of code with |Win in, shows the Windermere - Display Line in the proper position in the Output DataFrame ? But when I use the one with only - Display in, the Windermere - Display Row shows at the bottom of the Output DataFrame, all with NaN values in the Column, as even moving the .loc line of Code, a few lines up the Full Code, doesn't make a difference ?
The following is the DataFrame Output I get, when I use the |Win Line of Code, which is the correct Output :-
Location Date Dakota Hurricane Spitfire
0 Woodspring Wings - Display 01-Jul-2006 D H S
1 Duxford Flying Legends - Display 08-Jul-2006 D H S
2 RAF Odiham - Display 27-Jul-2006 D -- S
3 East Fortune - Display 29-Jul-2006 D H S
4 Windermere - Display 30-Jul-2006 D H S
5 Whitby Carnival - Display 12-Aug-2006 D -- S
6 Weymouth Carnival - Display 16-Aug-2006 D H S
7 Dawlish Carnival - Display 17-Aug-2006 D H S
8 Elvington - Display 19-Aug-2006 D H S
9 Elvington - Display 20-Aug-2006 D H S
10 Twinwoods - Display 27-Aug-2006 D -- S
11 Bodelwyddan Castle - Display 28-Aug-2006 D -- S
And the following, is the Output I get when I use the - Display line of Code :-
Location Date Dakota Hurricane Spitfire
0 Woodspring Wings - Display 01-Jul-2006 D H S
1 Duxford Flying Legends - Display 08-Jul-2006 D H S
2 RAF Odiham - Display 27-Jul-2006 D -- S
3 East Fortune - Display 29-Jul-2006 D H S
4 Whitby Carnival - Display 12-Aug-2006 D -- S
5 Weymouth Carnival - Display 16-Aug-2006 D H S
6 Dawlish Carnival - Display 17-Aug-2006 D H S
7 Elvington - Display 19-Aug-2006 D H S
8 Elvington - Display 20-Aug-2006 D H S
9 Twinwoods - Display 27-Aug-2006 D -- S
10 Bodelwyddan Castle - Display 28-Aug-2006 D -- S
11 Windermere - Display NaN NaN NaN NaN
Could a Moderator, edit My DataFrame Outputs, if that is okay ?
I tidied them up, but they are still not displaying correctly.
Any help would be much appreciated
Regards
Eddie Winch
What I have tried:
As described in the Describe the Problem section.