-
Notifications
You must be signed in to change notification settings - Fork 7
/
readme.html
75 lines (49 loc) · 3.92 KB
/
readme.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
<h2>What´s this?</h2>
<p>If you are developing applications using new modern JavaScript frameworks like <em>Angular, Ember, Durandal</em> ... etc. you probably already know that this type of applications are not crawlable by search engine robots without a couple of extra steps.</p>
<h2>SEO according to Google</h2>
<p>If you want your JavaScript application to be crawlable, you need to implement some steps on your own. You can find information about the process on <a href="https://developers.google.com/webmasters/ajax-crawling/docs/getting-started?hl=iw">this Google document</a>. Take a look to it in order to understand better what is required both client and server side.</p>
<h2>What is AzureCrawler about?</h2>
<p>AzureCrawler helps with taking <em>HTML Snapshots</em> of your dynamically generated content.</p>
<p>This project is specific to Azure and is ready to be deployed as a Cloud Service. <em>AzureCrawler is a Worker Role that uses OWIN to Self-Host a Web API</em>. </p>
<p>Said that, it´s easy to bring the code to your own solution if you don´t want to use it as a separate Cloud Service. As well, if you are not using .NET and Azure it´s not complicated to port it to another platform like Amazon Web Services.</p>
<h2>How AzureCrawler works?</h2>
<p>The self-hosted Web API contained in AzureCrawler exposes a resource with an endpoint in:</p>
<p>```<br/>
POST api/snapshot</p>
<p>```</p>
<p>If you make a api call there, a <em>PhantomJS</em> process will run and take care of the <em>HTML Snapshot</em> against the provided url.</p>
<p>You can pass some parameters in the body of the POST call</p>
<p>```<br/>
string ApiId (required). The application identification<br/>
string Application (required). The application name<br/>
string Url (required). The url to crawl<br/>
bool Store (optional). If you want to store the snapshot for future calls<br/>
DateTime ExpirationDate (optional). The expiration of the stored snapshot<br/>
string UserAgent (optional). The user agent of the bot crawling your application</p>
<p>```</p>
<p><em>ApiId</em> and <em>Application</em> fields are required and will be validated together. </p>
<p>There isn´t any special mechanism for doing this validation more than the following private method:</p>
<p><code><br/>
/// <summary><br/>
/// Validate ApiKey. In the real world you should this against a custom store<br/>
/// </summary><br/>
/// <param name="apiKey">The api key</param><br/>
/// <param name="apiKey">The application</param><br/>
/// <returns>bool</returns><br/>
private bool ValidateCredentials(string apiKey, string application)<br/>
{<br/>
if (apiKey == "Any ApiId" && application == "Any Application name")<br/>
{<br/>
return true;<br/>
}<br/>
return false;<br/>
}<br/>
</code></p>
<p>So you can supply a new mechanism, use your own keys or use a database to store application credentials.</p>
<p>The <em>Url</em> is the resource you want to crawl. The PhantomJS process will take care of the snapshot and will wait until all the dynamically generated content will be loaded.</p>
<p>The latest fields are about providing information for storing the HTML Snapshot in the store you prefer to. </p>
<p>By default, AzureCrawler will store the snapshots in Azure Storage within a blob container with the name of the <em>Application</em> field. </p>
<p>If you do this, next time a bot requests the same <em>Url</em>, the snapshot will be provided from the storage. </p>
<p>When the snapshot stored expires, a new crawl will be done and a new snapshot will be stored.</p>
<h2>Know issues</h2>
<p>There is a <a href="http://stackoverflow.com/questions/20258352/net-azure-sdk-blob-request-returns-400-badrequest">incompatibility</a> between the Azure Compute Emulator included in the SDK 2.2. and the latest 3.x Storage assemblies so you should test with live containers until next Azure toolkit will be released.</p>