[Plugin] PawnScraper

Started by Kalcor, SA:MP, Aug 08, 2023, 07:46 AM

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

Kalcor, SA:MP

[Plugin] PawnScraper

PawnScraper







A powerful scraper plugin that provides interface for utlising html_parsers and css selectors in pawn.

Installing



Thanks to Southclaws,plugin installation is now much easier with sampctl




PHP Code:




sampctl p install Sreyas-Sreelal/pawn-scraper 






OR

  • Download suitable binary files from releases for your operating system

  • Add it your plugins folder

  • Add PawnScraper to server.cfg or  PawnScraper.so (for linux)

  • Add pawnscraper.inc in includes folder


Building

  • Clone the repo




    PHP Code:




    git clone https://github.com/Sreyas-Sreelal/pawn-scraper.git 







  • Compile the plugin using nightly compiler

    • Windows


      PHP Code:




      cargo +nightly-i686-pc-windows-msvc build --release 







    • Linux


      PHP Code:




      cargo +nightly-i686-unknown-linux-gnu build --release 









API

  • ParseHtmlDocument(document[])]

    • Params

      • document[] - string of html document


    • Returns

      • Html document instance id

      • if failed to parse document INVALID_HTML_DOC is returned


    • Example Usage




      PHP Code:




      new Html:doc ParseHtmlDocument("\

          <!DOCTYPE html>\

          <meta charset=\"utf-8\">\

          <title>Hello, world!</title>\

          <h1 class=\"foo\">Hello, <i>world!</i></h1>\

          "
      );

      ASSERT(doc != INVALID_HTML_DOC);

      DeleteHtml(doc); 








  • ResponseParseHtml(Response:id)

    • Params

      • id - Http response id returned from HttpGet


    • Returns

      • Html document instance id

      • if failed to parse document INVALID_HTML_DOC is returned


    • Example Usage




      PHP Code:




      new Response:response HttpGet("https://www.sa-mp.com");

      new 
      Html:doc ResponseParseHtml(response);

      ASSERT(doc != INVALID_HTML_DOC);

      DeleteHtml(doc); 








  • HttpGet(url[],Header:headerid=INVALID_HEADER)

    • Params

      • url[] - Url of a website

      • header - id of header object created using CreateHeader


    • Returns

      • Response id if successful

      • if failed to INVALID_HTTP_RESPONSE is returned


    • Example Usage




      PHP Code:




      new Response:response HttpGet("https://www.sa-mp.com");

      ASSERT(response != INVALID_HTTP_RESPONSE);

      DeleteResponse(response); 








  • HttpGetThreaded(playerid,callback[],url[],Header:headerid=INVALID_HEADER)

    • Params

      • playerid - id of the player

      • callback[] - name of the callback function to handle the response.

      • url[] - Url of a website

      • header - id of header object created using CreateHeader


    • Example Usage


      PHP Code:




      HttpGetThreaded(0,"MyHandler","https://sa-mp.com");

      //********

      forward MyHandler(playerid,Response:responseid);

      public 
      MyHandler(playerid,Response:responseid){

          
      ASSERT(responseid != INVALID_HTTP_RESPONSE);

          
      DeleteResponse(responseid);










  • ParseSelector(string[])

    • Params

      • string[] - CSS selector


    • Returns

      • Selector instance id if successful

      • if failed to INVALID_SELECTOR is returned


    • Example Usage




      PHP Code:




      new Selector:selector ParseSelector("h1 .foo");

      ASSERT(selector != INVALID_SELECTOR);

      DeleteSelector(selector); 








  • CreateHeader(...)

    • Params

      • key,value pairs of String type


    • Returns

      • Header instance id if successful

      • if failed to INVALID_HEADER is returned


    • Example Usage




      PHP Code:




      new Header:header CreateHeader(

          
      "User-Agent","Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"

      );

      ASSERT(header != INVALID_HEADER);

      new 
      Response:response HttpGet("https://sa-mp.com/",header);

      ASSERT(response != INVALID_HTTP_RESPONSE);

      ASSERT(DeleteHeader(header) == 1); 








  • GetNthElementName(Html:docid,Selector:selectorid,i  dx,string[],size = sizeof(string))

    • Params

      • docid - Html instance id

      • selectorid - CSS selector instance id

      • idx - the n'th occurence of element in the document (starts from 0)

      • string[] - element name is stored

      • size - sizeof string


    • Returns

      • 1 if successful

      • 0 if failed


    • Example Usage




      PHP Code:




      new Html:doc ParseHtmlDocument("\

          <!DOCTYPE html>\

          <meta charset=\"utf-8\">\

          <title>Hello, world!</title>\

          <h1 class=\"foo\">Hello, <i>world!</i></h1>\

      "
      );

      ASSERT(doc != INVALID_HTML_DOC);


      new 
      Selector:selector ParseSelector("i");

      ASSERT(selector != INVALID_SELECTOR);


      new 
      i= -1,element_name[10];

      while(
      GetNthElementName(doc,selector,++i,element_name)!=0){

          
      ASSERT(strcmp(element_name,"i") == 0);

      }


      DeleteSelector(selector);

      DeleteHtml(doc); 








  • GetNthElementText(Html:docid,Selector:selectorid,i  dx,string[],size = sizeof(string))

    • Params

      • docid - Html instance id

      • selectorid - CSS selector instance id

      • idx - the n'th occurence of element in the document (starts from 0)

      • string[] - element name

      • size - sizeof string


    • Returns

      • 1 if successful

      • 0 if failed


    • Example Usage




      PHP Code:




      new Html:doc ParseHtmlDocument("\

          <!DOCTYPE html>\

          <meta charset=\"utf-8\">\

          <title>Hello, world!</title>\

          <h1 class=\"foo\">Hello, <i>world!</i></h1>\

      "
      );

      ASSERT(doc != INVALID_HTML_DOC);


      new 
      Selector:selector ParseSelector("h1.foo");

      ASSERT(selector != INVALID_SELECTOR);


      new 
      element_text[20];

      ASSERT(GetNthElementText(doc,selector,0,element_text) == 1);


      new 
      check strcmp(element_text,("Hello, world!"));

      ASSERT(check == 0);


      DeleteSelector(selector);

      DeleteHtml(doc); 








  • GetNthElementAttrVal(Html:docid,Selector:selectori  d,idx,attribute[],string[],size = sizeof(string))

    • Params

      • docid - Html instance id

      • selectorid - CSS selector instance id

      • idx - the n'th occurence of element in the document (starts from 0)

      • attribute[] - the attribute of element

      • string[] - element name

      • size - sizeof string


    • Returns

      • 1 if successful

      • 0 if failed


    • Example Usage




      PHP Code:




      new Html:doc ParseHtmlDocument("\

          <!DOCTYPE html>\

          <meta charset=\"utf-8\">\

          <title>Hello, world!</title>\

          <h1 class=\"foo\">Hello, <i>world!</i></h1>\

      "
      );

      ASSERT(doc != INVALID_HTML_DOC);


      new 
      Selector:selector ParseSelector("h1");

      ASSERT(selector != INVALID_SELECTOR);


      new 
      element_attribute[20];

      ASSERT(GetNthElementAttrVal(doc,selector,0,"class",element_attribute) == 1);


      new 
      check strcmp(element_attribute,("foo"));

      ASSERT(check == 0);


      DeleteSelector(selector);

      DeleteHtml(doc); 









  • DeleteHtml(Html:id)

    • Params

      • id - html instance to be deleted


    • Returns

      • 1 if successful

      • 0 if failed




  • DeleteSelector(Selector:id)

    • Params

      • id - selector instance to be deleted


    • Returns

      • 1 if successful

      • 0 if failed




  • DeleteResponse(Html:id)

    • Params

      • id - response instance to be deleted


    • Returns

      • 1 if successful

      • 0 if failed




  • DeleteHeader(Header:id)

    • Params

      • id - header instance to be deleted


    • Returns

      • 1 if successful

      • 0 if failed






Example Usage



A small example to fetch all links in wiki.sa-mp.com




PHP Code:




new Response:response HttpGet("https://wiki.sa-mp.com");

if(
response == INVALID_HTTP_RESPONSE){

    
printf("HTTP ERROR");

    return;

}


new 
Html:html ResponseParseHtml(response);

if(
html == INVALID_HTML_DOC){

    
DeleteResponse(response);

    return;

}


new 
Selector:selector ParseSelector("a");

if(
selector == INVALID_SELECTOR){

    
DeleteResponse(response);

    
DeleteHtml(html);

    return;

}


new 
str[500],i;

while(
GetNthElementAttrVal(html,selector,i,"href",str)){

    
printf("%s",str);

    ++
i;

}

//delete created objects after the usage..

DeleteHtml(html);

DeleteResponse(response);

DeleteSelector(selector); 








The same above with threaded http call would be




PHP Code:




HttpGetThreaded(0,"MyHandler","https://wiki.sa-mp.com");

//...

forward MyHandler(playerid,Response:responseid);

public 
MyHandler(playerid,Response:responseid){

    if(
responseid == INVALID_HTTP_RESPONSE){

        
printf("HTTP ERROR");

        return 
0;

    }


    new 
Html:html ResponseParseHtml(responseid);

    if(
html == INVALID_HTML_DOC){

        
DeleteResponse(response);

        return 
0;

    }


    new 
Selector:selector ParseSelector("a");

    if(
selector == INVALID_SELECTOR){

        
DeleteResponse(response);

        
DeleteHtml(html);

        return 
0;

    }


    new 
str[500],i;

    while(
GetNthElementAttrVal(html,selector,i,"href",str)){

        
printf("%s",str);

        ++
i;

    }


    
DeleteHtml(html);

    
Delete(response);

    
DeleteSelector(selector);

    return 
1;












More examples can be found in examples



Repository

https://github.com/Sreyas-Sreelal/pawn-scraper



Note



The plugin is in primary stage and more tests and features needed to be added.I'm open to any kind of contribution, just open a pull request if you have anything to improve or add new features.



Special thanks


Source: [Plugin] PawnScraper