XPath Na v igation W E B SC R AP IN G IN P YTH ON Thomas Laetsch Data Scientist , NYU
Slashes and Brackets Single for w ard slash / looks for w ard one generation Do u ble for w ard slash // looks for w ard all f u t u re generations Sq u are brackets [] help narro w in on speci � c elements WEB SCRAPING IN PYTHON
To Bracket or not to Bracket xpath = '/html/body' xpath = '/html[1]/body[1]' Gi v e the same selection WEB SCRAPING IN PYTHON
A Bod y of P xpath = '/html/body/p' WEB SCRAPING IN PYTHON
The Birds and the Ps xpath = '/html/body/div/p' xpath = '/html/body/div/p[2]' WEB SCRAPING IN PYTHON
Do u ble Slashing the Brackets xpath = '//p' xpath = '//p[1]' WEB SCRAPING IN PYTHON
The Wildcard xpath = '/html/body/*' The asterisks * is the "w ildcard " WEB SCRAPING IN PYTHON
Xposé W E B SC R AP IN G IN P YTH ON
Off the Beaten XPath W E B SC R AP IN G IN P YTH ON Thomas Laetsch Data Scientist , NYU
( At ) trib u te @ represents " a � rib u te " @class @id @href WEB SCRAPING IN PYTHON
Brackets and Attrib u tes WEB SCRAPING IN PYTHON
Brackets and Attrib u tes xpath = '//p[@class="class-1"]' WEB SCRAPING IN PYTHON
Brackets and Attrib u tes xpath = '//*[@id="uid"]' WEB SCRAPING IN PYTHON
Brackets and Attrib u tes xpath = '//div[@id="uid"]/p[2]' WEB SCRAPING IN PYTHON
Content w ith Contains Xpath Contains Notation : contains ( @ a � ri - name , " string - e x pr " ) WEB SCRAPING IN PYTHON
Contain This xpath = '//*[contains(@class,"class-1")]' WEB SCRAPING IN PYTHON
Contain This xpath = '//*[@class="class-1"]' WEB SCRAPING IN PYTHON
Get Class y xpath = '/html/body/div/p[2]' WEB SCRAPING IN PYTHON
Get Class y xpath = '/html/body/div/p[2]/@class' WEB SCRAPING IN PYTHON
End of the Path W E B SC R AP IN G IN P YTH ON
Introd u ction to the scrap y Selector W E B SC R AP IN G IN P YTH ON Thomas Laetsch Data Scientist , NYU
Setting u p a Selector from scrapy import Selector html = ''' <html> <body> <div class="hello datacamp"> <p>Hello World!</p> </div> <p>Enjoy DataCamp!</p> </body> </html> ''' sel = Selector( text = html ) Created a scrap y Selector object u sing a string w ith the html code The selector sel has selected the entire html doc u ment WEB SCRAPING IN PYTHON
Selecting Selectors We can u se the xpath call w ithin a Selector to create ne w Selector s of speci � c pieces of the html code The ret u rn is a SelectorList of Selector objects sel.xpath("//p") # outputs the SelectorList: [<Selector xpath='//p' data='<p>Hello World!</p>'>, <Selector xpath='//p' data='<p>Enjoy DataCamp!</p>'>] WEB SCRAPING IN PYTHON
E x tracting Data from a SelectorList Use the extract() method >>> sel.xpath("//p") out: [<Selector xpath='//p' data='<p>Hello World!</p>'>, <Selector xpath='//p' data='<p>Enjoy DataCamp!</p>'>] >>> sel.xpath("//p").extract() out: [ '<p>Hello World!</p>', '<p>Enjoy DataCamp!</p>' ] We can u se extract_first() to get the � rst element of the list >>> sel.xpath("//p").extract_first() out: '<p>Hello World!</p>' WEB SCRAPING IN PYTHON
E x tracting Data from a Selector ps = sel.xpath('//p') second_p = ps[1] second_p.extract() out: '<p>Enjoy DataCamp!</p>' WEB SCRAPING IN PYTHON
Select This Co u rse ! W E B SC R AP IN G IN P YTH ON
" Inspecting the HTML " W E B SC R AP IN G IN P YTH ON Thomas Laetsch , PhD Data Scientist , NYU
" So u rce " = HTML Code WEB SCRAPING IN PYTHON
Inspecting Elements WEB SCRAPING IN PYTHON
HTML te x t to Selector from scrapy import Selector import requests url = 'https://www.datacamp.com/courses/all' html = requests.get( url ).content sel = Selector( text = html ) WEB SCRAPING IN PYTHON
Yo u Kno w O u r Secrets W E B SC R AP IN G IN P YTH ON
Recommend
More recommend