diff --git a/README.md b/README.md index 66ec022..0213fd0 100644 --- a/README.md +++ b/README.md @@ -40,8 +40,9 @@ Contents 1. [Arch/AUR](#arch-linux--arch-based-distributions) 1. [Helm/Kubernetes](#helm-chart-for-kubernetes) 4. [Environment Variables and Configuration](#environment-variables) -5. [Usage](#usage) -6. [Extra Steps](#extra-steps) +5. [Google Custom Search (BYOK)](#google-custom-search-byok) +6. [Usage](#usage) +7. [Extra Steps](#extra-steps) 1. [Set Primary Search Engine](#set-whoogle-as-your-primary-search-engine) 2. [Custom Redirecting](#custom-redirecting) 2. [Custom Bangs](#custom-bangs) @@ -50,10 +51,10 @@ Contents 5. [Using with Firefox Containers](#using-with-firefox-containers) 6. [Reverse Proxying](#reverse-proxying) 1. [Nginx](#nginx) -7. [Contributing](#contributing) -8. [FAQ](#faq) -9. [Public Instances](#public-instances) -10. [Screenshots](#screenshots) +8. [Contributing](#contributing) +9. [FAQ](#faq) +10. [Public Instances](#public-instances) +11. [Screenshots](#screenshots) ## Features - No ads or sponsored content @@ -475,7 +476,6 @@ There are a few optional environment variables available for customizing a Whoog | WHOOGLE_AUTOCOMPLETE | Controls visibility of autocomplete/search suggestions. Default on -- use '0' to disable. | | WHOOGLE_MINIMAL | Remove everything except basic result cards from all search queries. | | WHOOGLE_CSP | Sets a default set of 'Content-Security-Policy' headers | -| WHOOGLE_RESULTS_PER_PAGE | Set the number of results per page | | WHOOGLE_TOR_SERVICE | Enable/disable the Tor service on startup. Default on -- use '0' to disable. | | WHOOGLE_TOR_USE_PASS | Use password authentication for tor control port. | | WHOOGLE_TOR_CONF | The absolute path to the config file containing the password for the tor control port. Default: ./misc/tor/control.conf WHOOGLE_TOR_PASS must be 1 for this to work.| @@ -512,6 +512,103 @@ These environment variables allow setting default config values, but can be over | WHOOGLE_CONFIG_ANON_VIEW | Include the "anonymous view" option for each search result | | WHOOGLE_CONFIG_SHOW_USER_AGENT | Display the User Agent string used for search in results footer | +### Google Custom Search (BYOK) Environment Variables + +These environment variables configure the "Bring Your Own Key" feature for Google Custom Search API: + +| Variable | Description | +| -------------------- | ----------------------------------------------------------------------------------------- | +| WHOOGLE_CSE_API_KEY | Your Google API key with Custom Search API enabled | +| WHOOGLE_CSE_ID | Your Custom Search Engine ID (cx parameter) | +| WHOOGLE_USE_CSE | Enable Custom Search API by default (set to '1' to enable) | + +## Google Custom Search (BYOK) + +If Google blocks traditional search scraping (captchas, IP bans), you can use your own Google Custom Search Engine credentials as a fallback. This uses Google's official API with your own quota. + +### Why Use This? + +- **Reliability**: Official API never gets blocked or rate-limited (within quota) +- **Speed**: Direct JSON responses are faster than HTML scraping +- **Fallback**: Works when all scraping workarounds fail +- **Privacy**: Your searches still don't go through third parties—they go directly to Google with your own API key + +### Limitations vs Standard Whoogle + +| Feature | Standard Scraping | CSE API | +|------------------|--------------------------|---------------------| +| Daily limit | None (until blocked) | 100 free, then paid | +| Image search | ✅ Full support | ✅ Supported | +| News/Videos tabs | ✅ | ❌ Web results only | +| Speed | Slower (HTML parsing) | Faster (JSON) | +| Reliability | Can be blocked | Always works | + +### Setup Steps + +#### 1. Create a Custom Search Engine +1. Go to [Programmable Search Engine](https://programmablesearchengine.google.com/controlpanel/all) +2. Click **"Add"** to create a new search engine +3. Under "What to search?", select **"Search the entire web"** +4. Give it a name (e.g., "My Whoogle CSE") +5. Click **"Create"** +6. Copy your **Search Engine ID** + +#### 2. Get an API Key +1. Go to [Google Cloud Console](https://console.cloud.google.com/) +2. Create a new project or select an existing one +3. Go to **APIs & Services** → **Library** +4. Search for **"Custom Search API"** and click **Enable** +5. Go to **APIs & Services** → **Credentials** +6. Click **"Create Credentials"** → **"API Key"** +7. Copy your API key (looks like `AIza...`) + +#### 3. (Recommended) Restrict Your API Key +To prevent misuse if your key is exposed: +1. Click on your API key in Credentials +2. Under **"API restrictions"**, select **"Restrict key"** +3. Choose only **"Custom Search API"** +4. Under **"Application restrictions"**, consider adding IP restrictions if using on a server +5. Click **Save** + +#### 4. Configure Whoogle + +**Option A: Via Settings UI** +1. Open your Whoogle instance +2. Click the **Config** button +3. Scroll to "Google Custom Search (BYOK)" section +4. Enter your API Key and CSE ID +5. Check "Use Custom Search API" +6. Click **Apply** + +**Option B: Via Environment Variables** +```bash +WHOOGLE_CSE_API_KEY=AIza... +WHOOGLE_CSE_ID=23f... +WHOOGLE_USE_CSE=1 +``` + +### Pricing & Avoiding Charges + +| Tier | Queries | Cost | +|------|------------------|-----------------------| +| Free | 100/day | $0 | +| Paid | Up to 10,000/day | $5 per 1,000 queries | + +**⚠️ To avoid unexpected charges:** + +1. **Don't add a payment method** to Google Cloud (safest option—API stops at 100/day) +2. **Set a billing budget alert**: [Billing → Budgets & Alerts](https://console.cloud.google.com/billing/budgets) +3. **Cap API usage**: APIs & Services → Custom Search API → Quotas → Set "Queries per day" to 100 +4. **Monitor usage**: APIs & Services → Custom Search API → Metrics + +### Troubleshooting + +| Error | Cause | Solution | +|---------------------|---------------------------|-----------------------------------------------------------------| +| "API key not valid" | Invalid or restricted key | Check key in Cloud Console, ensure Custom Search API is enabled | +| "Quota exceeded" | Hit 100/day limit | Wait until midnight PT, or enable billing | +| "Invalid CSE ID" | Wrong cx parameter | Copy ID from Programmable Search Engine control panel | + ## Usage Same as most search engines, with the exception of filtering by time range. diff --git a/app/filter.py b/app/filter.py index 70059d2..97dcc0f 100644 --- a/app/filter.py +++ b/app/filter.py @@ -222,7 +222,7 @@ class Filter: Returns: None (The soup object is modified directly) """ - if not div: + if not div or not isinstance(div, Tag): return for d in div.find_all('div', recursive=True): @@ -437,6 +437,11 @@ class Filter: if not self.main_divs: return + # Skip collapsing for CSE (Custom Search Engine) results + # CSE results have a data-cse attribute on the main container + if self.soup.find(attrs={'data-cse': 'true'}): + return + # Loop through results and check for the number of child divs in each for result in self.main_divs.find_all(): result_children = pull_child_divs(result) diff --git a/app/models/config.py b/app/models/config.py index fb2f31f..5370db3 100644 --- a/app/models/config.py +++ b/app/models/config.py @@ -48,6 +48,8 @@ class Config: self.show_user_agent = read_config_bool('WHOOGLE_CONFIG_SHOW_USER_AGENT') # Add user agent related keys to safe_keys + # Note: CSE credentials (cse_api_key, cse_id) are intentionally NOT included + # in safe_keys for security - they should not be shareable via URL self.safe_keys = [ 'lang_search', 'lang_interface', @@ -92,6 +94,11 @@ class Config: self.preferences_encrypted = read_config_bool('WHOOGLE_CONFIG_PREFERENCES_ENCRYPTED') self.preferences_key = os.getenv('WHOOGLE_CONFIG_PREFERENCES_KEY', '') + # Google Custom Search Engine (CSE) BYOK settings + self.cse_api_key = os.getenv('WHOOGLE_CSE_API_KEY', '') + self.cse_id = os.getenv('WHOOGLE_CSE_ID', '') + self.use_cse = read_config_bool('WHOOGLE_USE_CSE') + self.accept_language = False # Skip setting custom config if there isn't one diff --git a/app/request.py b/app/request.py index 6348fd5..dbeae8b 100644 --- a/app/request.py +++ b/app/request.py @@ -216,18 +216,11 @@ class Request: """ def __init__(self, normal_ua, root_path, config: Config, http_client=None): - results_per_page = str(os.getenv('WHOOGLE_RESULTS_PER_PAGE', 10)) - self.search_url = ( - 'https://www.google.com/search?gbv=1&num=' - f'{results_per_page}&q=' - ) + self.search_url = 'https://www.google.com/search?gbv=1&q=' # Google Images rejects the lightweight gbv=1 interface. Use the # modern udm=2 entrypoint specifically for image searches to avoid the # "update your browser" interstitial. - self.image_search_url = ( - 'https://www.google.com/search?udm=2&num=' - f'{results_per_page}&q=' - ) + self.image_search_url = 'https://www.google.com/search?udm=2&q=' # Optionally send heartbeat to Tor to determine availability # Only when Tor is enabled in config to avoid unnecessary socket usage if config.tor: diff --git a/app/routes.py b/app/routes.py index e111102..85bd2d9 100644 --- a/app/routes.py +++ b/app/routes.py @@ -17,6 +17,7 @@ from app import app from app.models.config import Config from app.models.endpoint import Endpoint from app.request import Request, TorError +from app.services.cse_client import CSEException from app.utils.bangs import suggest_bang, resolve_bang from app.utils.misc import empty_gif, placeholder_img, get_proxy_host_url, \ fetch_favicon @@ -356,6 +357,30 @@ def search(): session['config']['tor'] = False if e.disable else session['config'][ 'tor'] return redirect(url_for('.index')) + except CSEException as e: + localization_lang = g.user_config.get_localization_lang() + translation = app.config['TRANSLATIONS'][localization_lang] + wants_json = ( + request.args.get('format') == 'json' or + 'application/json' in request.headers.get('Accept', '') or + 'application/*+json' in request.headers.get('Accept', '') + ) + error_msg = f"Custom Search API Error: {e.message}" + if e.is_quota_error: + error_msg = ("Google Custom Search API quota exceeded. " + "Free tier allows 100 queries/day. " + "Wait until midnight PT or disable CSE in settings.") + if wants_json: + return jsonify({ + 'error': True, + 'error_message': error_msg, + 'query': urlparse.unquote(query) + }), e.code + return render_template( + 'error.html', + error_message=error_msg, + translation=translation, + config=g.user_config), e.code wants_json = ( request.args.get('format') == 'json' or @@ -424,6 +449,16 @@ def search(): search_util.search_type, g.user_config.preferences, translation) + + # Filter out unsupported tabs when CSE is enabled + # CSE only supports web (all) and image search, not videos/news + use_cse = ( + g.user_config.use_cse and + g.user_config.cse_api_key and + g.user_config.cse_id + ) + if use_cse: + tabs = {k: v for k, v in tabs.items() if k in ['all', 'images', 'maps']} # Feature to display currency_card # Since this is determined by more than just the diff --git a/app/services/cse_client.py b/app/services/cse_client.py new file mode 100644 index 0000000..8830be2 --- /dev/null +++ b/app/services/cse_client.py @@ -0,0 +1,452 @@ +"""Google Custom Search Engine (CSE) API Client + +This module provides a client for Google's Custom Search JSON API, +allowing users to bring their own API key (BYOK) for search functionality. +""" + +import httpx +from typing import Optional +from dataclasses import dataclass +from urllib.parse import urlparse + +from flask import render_template + + +# Google Custom Search API endpoint +CSE_API_URL = 'https://www.googleapis.com/customsearch/v1' + + +class CSEException(Exception): + """Exception raised for CSE API errors""" + def __init__(self, message: str, code: int = 500, is_quota_error: bool = False): + self.message = message + self.code = code + self.is_quota_error = is_quota_error + super().__init__(self.message) + + +@dataclass +class CSEError: + """Represents an error from the CSE API""" + code: int + message: str + + @property + def is_quota_exceeded(self) -> bool: + return self.code == 429 or 'quota' in self.message.lower() + + @property + def is_invalid_key(self) -> bool: + return self.code == 400 or 'invalid' in self.message.lower() + + +@dataclass +class CSEResult: + """Represents a single search result from CSE API""" + title: str + link: str + snippet: str + display_link: str + html_title: Optional[str] = None + html_snippet: Optional[str] = None + # Image-specific fields (populated for image search) + image_url: Optional[str] = None + thumbnail_url: Optional[str] = None + image_width: Optional[int] = None + image_height: Optional[int] = None + context_link: Optional[str] = None # Page where image was found + + +@dataclass +class CSEResponse: + """Represents a complete CSE API response""" + results: list[CSEResult] + total_results: str + search_time: float + query: str + start_index: int + is_image_search: bool = False + error: Optional[CSEError] = None + + @property + def has_error(self) -> bool: + return self.error is not None + + @property + def has_results(self) -> bool: + return len(self.results) > 0 + + +class CSEClient: + """Client for Google Custom Search Engine API + + Usage: + client = CSEClient(api_key='your-key', cse_id='your-cse-id') + response = client.search('python programming') + + if response.has_error: + print(f"Error: {response.error.message}") + else: + for result in response.results: + print(f"{result.title}: {result.link}") + """ + + def __init__(self, api_key: str, cse_id: str, timeout: float = 10.0): + """Initialize CSE client + + Args: + api_key: Google API key with Custom Search API enabled + cse_id: Custom Search Engine ID (cx parameter) + timeout: Request timeout in seconds + """ + self.api_key = api_key + self.cse_id = cse_id + self.timeout = timeout + self._client = httpx.Client(timeout=timeout) + + def search( + self, + query: str, + start: int = 1, + num: int = 10, + safe: str = 'off', + language: str = '', + country: str = '', + search_type: str = '' + ) -> CSEResponse: + """Execute a search query against the CSE API + + Args: + query: Search query string + start: Starting result index (1-based, for pagination) + num: Number of results to return (max 10) + safe: Safe search setting ('off', 'medium', 'high') + language: Language restriction (e.g., 'lang_en') + country: Country restriction (e.g., 'countryUS') + search_type: Type of search ('image' for image search, '' for web) + + Returns: + CSEResponse with results or error information + """ + params = { + 'key': self.api_key, + 'cx': self.cse_id, + 'q': query, + 'start': start, + 'num': min(num, 10), # API max is 10 + 'safe': safe, + } + + # Add search type for image search + if search_type == 'image': + params['searchType'] = 'image' + + # Add optional parameters + if language: + # CSE uses 'lr' for language restrict + params['lr'] = language + if country: + # CSE uses 'cr' for country restrict + params['cr'] = country + + try: + response = self._client.get(CSE_API_URL, params=params) + data = response.json() + + # Check for API errors + if 'error' in data: + error_info = data['error'] + return CSEResponse( + results=[], + total_results='0', + search_time=0.0, + query=query, + start_index=start, + error=CSEError( + code=error_info.get('code', 500), + message=error_info.get('message', 'Unknown error') + ) + ) + + # Parse successful response + search_info = data.get('searchInformation', {}) + items = data.get('items', []) + is_image = search_type == 'image' + + results = [] + for item in items: + # Extract image-specific data if present + image_data = item.get('image', {}) + + results.append(CSEResult( + title=item.get('title', ''), + link=item.get('link', ''), + snippet=item.get('snippet', ''), + display_link=item.get('displayLink', ''), + html_title=item.get('htmlTitle'), + html_snippet=item.get('htmlSnippet'), + # Image fields + image_url=item.get('link') if is_image else None, + thumbnail_url=image_data.get('thumbnailLink'), + image_width=image_data.get('width'), + image_height=image_data.get('height'), + context_link=image_data.get('contextLink') + )) + + return CSEResponse( + results=results, + total_results=search_info.get('totalResults', '0'), + search_time=float(search_info.get('searchTime', 0)), + query=query, + start_index=start, + is_image_search=is_image + ) + + except httpx.TimeoutException: + return CSEResponse( + results=[], + total_results='0', + search_time=0.0, + query=query, + start_index=start, + error=CSEError(code=408, message='Request timed out') + ) + except httpx.RequestError as e: + return CSEResponse( + results=[], + total_results='0', + search_time=0.0, + query=query, + start_index=start, + error=CSEError(code=500, message=f'Request failed: {str(e)}') + ) + except Exception as e: + return CSEResponse( + results=[], + total_results='0', + search_time=0.0, + query=query, + start_index=start, + error=CSEError(code=500, message=f'Unexpected error: {str(e)}') + ) + + def close(self): + """Close the HTTP client""" + self._client.close() + + def __enter__(self): + return self + + def __exit__(self, *args): + self.close() + + +def cse_results_to_html(response: CSEResponse, query: str) -> str: + """Convert CSE API response to HTML matching Whoogle's result format + + This generates HTML that mimics the structure expected by Whoogle's + existing filter and result processing pipeline. + + Args: + response: CSEResponse from the API + query: Original search query + + Returns: + HTML string formatted like Google search results + """ + if response.has_error: + error = response.error + if error.is_quota_exceeded: + return _error_html( + 'API Quota Exceeded', + 'Your Google Custom Search API quota has been exceeded. ' + 'Free tier allows 100 queries/day. Wait until midnight PT ' + 'or enable billing in Google Cloud Console.' + ) + elif error.is_invalid_key: + return _error_html( + 'Invalid API Key', + 'Your Google Custom Search API key is invalid. ' + 'Please check your API key and CSE ID in settings.' + ) + else: + return _error_html('Search Error', error.message) + + if not response.has_results: + return _no_results_html(query) + + # Use different HTML structure for image vs web results + if response.is_image_search: + return _image_results_html(response, query) + + # Build HTML results matching Whoogle's expected structure + results_html = [] + + for result in response.results: + # Escape HTML in content + title = _escape_html(result.title) + snippet = _escape_html(result.snippet) + link = result.link + display_link = _escape_html(result.display_link) + + # Use HTML versions if available (they have bold tags for query terms) + if result.html_title: + title = result.html_title + if result.html_snippet: + snippet = result.html_snippet + + # Match the structure used by Google/mock results + result_html = f''' +
+
+ +

{title}

+
{display_link}
+
+
+
+
+ {snippet} +
+
+
+ ''' + results_html.append(result_html) + + # Build pagination if needed + pagination_html = '' + if int(response.total_results) > 10: + pagination_html = _pagination_html(response.start_index, response.query) + + # Wrap in expected structure + # Add data-cse attribute to prevent collapse_sections from collapsing these results + return f''' + + +
+
+
+
+
+ +
+ {pagination_html} +
+
+
+
+ + + ''' + + +def _escape_html(text: str) -> str: + """Escape HTML special characters""" + if not text: + return '' + return (text + .replace('&', '&') + .replace('<', '<') + .replace('>', '>') + .replace('"', '"') + .replace("'", ''')) + + +def _error_html(title: str, message: str) -> str: + """Generate error HTML""" + return f''' + + +
+
+

{_escape_html(title)}

+

{_escape_html(message)}

+
+
+ + + ''' + + +def _no_results_html(query: str) -> str: + """Generate no results HTML""" + return f''' + + +
+
+

No results found for {_escape_html(query)}

+
+
+ + + ''' + + +def _image_results_html(response: CSEResponse, query: str) -> str: + """Generate HTML for image search results using the imageresults template + + Args: + response: CSEResponse with image results + query: Original search query + + Returns: + HTML string formatted for image results display + """ + # Convert CSE results to the format expected by imageresults.html template + results = [] + for result in response.results: + image_url = result.image_url or result.link + thumbnail_url = result.thumbnail_url or image_url + web_page = result.context_link or result.link + domain = urlparse(web_page).netloc if web_page else result.display_link + + results.append({ + 'domain': domain, + 'img_url': image_url, + 'web_page': web_page, + 'img_tbn': thumbnail_url + }) + + # Build pagination link if needed + next_link = None + if int(response.total_results) > response.start_index + len(response.results) - 1: + next_start = response.start_index + 10 + next_link = f'search?q={query}&tbm=isch&start={next_start}' + + # Use the same template as regular image results + return render_template( + 'imageresults.html', + length=len(results), + results=results, + view_label="View Image", + next_link=next_link + ) + + +def _pagination_html(current_start: int, query: str) -> str: + """Generate pagination links""" + # CSE API uses 1-based indexing, 10 results per page + current_page = (current_start - 1) // 10 + 1 + + prev_link = '' + next_link = '' + + if current_page > 1: + prev_start = (current_page - 2) * 10 + 1 + prev_link = f'Previous' + + next_start = current_page * 10 + 1 + next_link = f'Next' + + return f''' + + ''' diff --git a/app/templates/index.html b/app/templates/index.html index c6fe19e..9425fc1 100644 --- a/app/templates/index.html +++ b/app/templates/index.html @@ -257,6 +257,30 @@ + +
+ Google Custom Search (BYOK) +
Setup Guide
+
+
+ + +
— Enable to use your own Google API key (100 free queries/day)
+
+
+ + +
+
+ + +
diff --git a/app/utils/search.py b/app/utils/search.py index 771f594..a4b9f5e 100644 --- a/app/utils/search.py +++ b/app/utils/search.py @@ -5,6 +5,7 @@ from app.filter import Filter from app.request import gen_query from app.utils.misc import get_proxy_host_url from app.utils.results import get_first_link +from app.services.cse_client import CSEClient, cse_results_to_html from bs4 import BeautifulSoup as bsoup from cryptography.fernet import Fernet, InvalidToken from flask import g @@ -142,6 +143,89 @@ class Search: config=self.config, query=self.query, page_url=self.request.url) + + # Check if CSE (Custom Search Engine) should be used + use_cse = ( + self.config.use_cse and + self.config.cse_api_key and + self.config.cse_id + ) + + if use_cse: + # Use Google Custom Search API + return self._generate_cse_response(content_filter, root_url, mobile) + + # Default: Use traditional scraping method + return self._generate_scrape_response(content_filter, root_url, mobile) + + def _generate_cse_response(self, content_filter: Filter, root_url: str, mobile: bool) -> str: + """Generate response using Google Custom Search API + + Args: + content_filter: Filter instance for processing results + root_url: Root URL of the instance + mobile: Whether this is a mobile request + + Returns: + str: HTML response string + """ + # Get pagination start index from request params + start = int(self.request_params.get('start', 1)) + + # Determine safe search setting + safe = 'high' if self.config.safe else 'off' + + # Determine search type (web or image) + # tbm=isch or udm=2 indicates image search + search_type = '' + if self.search_type == 'isch' or self.request_params.get('udm') == '2': + search_type = 'image' + + # Create CSE client and perform search + with CSEClient( + api_key=self.config.cse_api_key, + cse_id=self.config.cse_id + ) as client: + response = client.search( + query=self.query, + start=start, + safe=safe, + language=self.config.lang_search, + country=self.config.country, + search_type=search_type + ) + + # Convert CSE response to HTML + html_content = cse_results_to_html(response, self.query) + + # Store full query for tabs + self.full_query = self.query + + # Parse and filter the HTML + html_soup = bsoup(html_content, 'html.parser') + + # Handle feeling lucky + if self.feeling_lucky: + if response.has_results and response.results: + return response.results[0].link + self.feeling_lucky = False + + # Apply content filter (encrypts links, applies CSS, etc.) + formatted_results = content_filter.clean(html_soup) + + return str(formatted_results) + + def _generate_scrape_response(self, content_filter: Filter, root_url: str, mobile: bool) -> str: + """Generate response using traditional HTML scraping + + Args: + content_filter: Filter instance for processing results + root_url: Root URL of the instance + mobile: Whether this is a mobile request + + Returns: + str: HTML response string + """ full_query = gen_query(self.query, self.request_params, self.config) diff --git a/requirements.txt b/requirements.txt index fae9740..a555a1a 100644 --- a/requirements.txt +++ b/requirements.txt @@ -30,5 +30,5 @@ h11>=0.16.0 validators==0.35.0 waitress==3.0.2 wcwidth==0.2.14 -Werkzeug==3.1.3 +Werkzeug==3.1.4 python-dotenv==1.1.1 diff --git a/whoogle.template.env b/whoogle.template.env index ee2a502..2100f84 100644 --- a/whoogle.template.env +++ b/whoogle.template.env @@ -72,9 +72,6 @@ # Remove everything except basic result cards from all search queries #WHOOGLE_MINIMAL=0 -# Set the number of results per page -#WHOOGLE_RESULTS_PER_PAGE=10 - # Controls visibility of autocomplete/search suggestions #WHOOGLE_AUTOCOMPLETE=1