⚙️ Convert HTML to Markdown. Even works with entire websites and can be extended through rules.

Last update: Jan 6, 2023

Comments: 17

html-to-markdown

Convert HTML into Markdown with Go. It is using an HTML Parser to avoid the use of regexp as much as possible. That should prevent some weird cases and allows it to be used for cases where the input is totally unknown.

Installation

go get github.com/JohannesKaufmann/html-to-markdown

Usage

import md "github.com/JohannesKaufmann/html-to-markdown"

converter := md.NewConverter("", true, nil)

html = `<strong>Important</strong>`

markdown, err := converter.ConvertString(html)
if err != nil {
  log.Fatal(err)
}
fmt.Println("md ->", markdown)

If you are already using goquery you can pass a selection to Convert.

markdown, err := converter.Convert(selec)

Using it on the command line

If you want to make use of html-to-markdown on the command line without any Go coding, check out html2md, a cli wrapper for html-to-markdown that has all the following options and plugins builtin.

Options

The third parameter to md.NewConverter is *md.Options.

For example you can change the character that is around a bold text ("**") to a different one (for example "__") by changing the value of StrongDelimiter.

opt := &md.Options{
  StrongDelimiter: "__", // default: **
  // ...
}
converter := md.NewConverter("", true, opt)

For all the possible options look at godocs and for a example look at the example.

Adding Rules

converter.AddRules(
  md.Rule{
    Filter: []string{"del", "s", "strike"},
    Replacement: func(content string, selec *goquery.Selection, opt *md.Options) *string {
      // You need to return a pointer to a string (md.String is just a helper function).
      // If you return nil the next function for that html element
      // will be picked. For example you could only convert an element
      // if it has a certain class name and fallback if not.
      content = strings.TrimSpace(content)
      return md.String("~" + content + "~")
    },
  },
  // more rules
)

For more information have a look at the example add_rules.

Using Plugins

If you want plugins (github flavored markdown like striketrough, tables, ...) you can pass it to Use.

import "github.com/JohannesKaufmann/html-to-markdown/plugin"

// Use the `GitHubFlavored` plugin from the `plugin` package.
converter.Use(plugin.GitHubFlavored())

Or if you only want to use the Strikethrough plugin. You can change the character that distinguishes the text that is crossed out by setting the first argument to a different value (for example "~~" instead of "~").

converter.Use(plugin.Strikethrough(""))

For more information have a look at the example github_flavored.

Writing Plugins

Have a look at the plugin folder for a reference implementation. The most basic one is Strikethrough.

Other Methods

Godoc

`func (c Converter) Keep(tags ...string) Converter`

Determines which elements are to be kept and rendered as HTML.

`func (c Converter) Remove(tags ...string) Converter`

Determines which elements are to be removed altogether i.e. converted to an empty string.

Issues

If you find HTML snippets (or even full websites) that don't produce the expected results, please open an issue!

Related Projects

turndown (js), a very good library written in javascript.
lunny/html2md, which is using regex instead of goquery. I came around a few edge case when using it (leaving some html comments, ...) so I wrote my own.

Owner

Johannes Kaufmann

Finance and Operations @ Code+Design & Software Engineering Student @ CODE

https://github.com/JohannesKaufmann/html-to-markdown

Comments

Mention wrapper program in README.md?

Hi @JohannesKaufmann

I love your project so much that I added a wrapper program to it:

$ html2md -i https://github.com/suntong/lang
[Homepage](https://github.com/)
. . . 


$ html2md -i https://github.com/suntong/lang -s 'div#readme'   
## README.md

# lang -- programming languages demos

Would it be OK that I PR to README.md to mention html2md when it is ready? So far I'm having these planned out:

$ html2md
HTML to Markdown
Version 0.1.0 built on 2020-07-26
Copyright (C) 2020, Tong Sun

HTML to Markdown converter on command line

Usage:
  html2md [Options...]

Options:

  -h, --help                       display help information 
  -i, --in                        *The html/xml file to read from (or stdin) 
  -d, --domain                     Domain of the web page, needed for links when --in is not url 
  -s, --sel                        CSS/goquery selectors [=body]
  -v, --verbose                    Verbose mode (Multiple -v options increase the verbosity.) 

      --opt-heading-style          Option HeadingStyle 
      --opt-horizontal-rule        Option HorizontalRule 
      --opt-bullet-list-marker     Option BulletListMarker 
      --opt-code-block-style       Option CodeBlockStyle 
      --opt-fence                  Option Fence 
      --opt-em-delimiter           Option EmDelimiter 
      --opt-strong-delimiter       Option StrongDelimiter 
      --opt-link-style             Option LinkStyle 
      --opt-link-reference-style   Option LinkReferenceStyle 

  -A, --plugin-conf-attachment     Plugin ConfluenceAttachments 
  -C, --plugin-conf-code           Plugin ConfluenceCodeBlock 
  -F, --plugin-frontmatter         Plugin FrontMatter 
  -G, --plugin-gfm                 Plugin GitHubFlavored 
  -S, --plugin-strikethrough       Plugin Strikethrough 
  -T, --plugin-table               Plugin Table 
  -L, --plugin-task-list           Plugin TaskListItems 
  -V, --plugin-vimeo               Plugin VimeoEmbed 
  -Y, --plugin-youtube             Plugin YoutubeEmbed

Thanks

New confluence code block parser plugin

Hi @JohannesKaufmann

I have the pleasure of working with this library. I had to add a confluence page parser to parse out code blocks. And I thought I'd add it back to you if you like the plugin / you think the changes are appropriate.

Thanks for your great work on this! :)

html
not suport.

var html =`
<p>1. xxx <br/>2. xxxx<br/>3. xxx</p><p><span class="img-wrap"><img src="xxx"></span><br>4. golang<br>a. xx<br>b. xx</p>
`

func Test_md(t *testing.T) {
	var converter = md.NewConverter("", true, nil)
	md_str,_ := converter.ConvertString(html)
	println(md_str)
}

output

1\. xxx 2\. xxxx3\. xxx

![](xxx)4\. golanga. xxb. xx

want

1. xxx 
2. xxxx
3. xxx

![](xxx)
4. golang
a. xx
b. xx

Unexpected result with additional rule for custom self-closing tags

I was following this example to write a rule to process custom <mention> tags in my input: https://github.com/JohannesKaufmann/html-to-markdown/blob/master/examples/custom_tag/main.go

Result was quite surprising, however not sure if this is a bug or misuse or maybe some limitations of the library?

Code:

package main

import (
	"fmt"
	"log"

	md "github.com/JohannesKaufmann/html-to-markdown"
	"github.com/PuerkitoBio/goquery"
)

func main() {
	html := `
	test
	
	<mention user="user1" />
	<mention user="user2" />
	<mention user="user3" />

	blabla
	`

	rule := md.Rule{
		Filter: []string{"mention"},
		Replacement: func(content string, selec *goquery.Selection, opt *md.Options) *string {
			result := "@"

			u, ok := selec.Attr("user")
			if ok {
				result += u
			} else {
				result += "unknown"
			}

			return &result
		},
	}

	conv := md.NewConverter("", true, nil)
	conv.AddRules(rule)

	markdown, err := conv.ConvertString(html)
	if err != nil {
		log.Fatalln(err)
	}

	fmt.Println("markdown:\n", markdown)
}

Expected output:

markdown:
 test
	
 @user1
 @user2
 @user3

 blabla

Observed output:

markdown:
 test

 @user1

Moreover, if I put these strings to debug what is going on in Replacement calls, it becomes even more weird:

		Replacement: func(content string, selec *goquery.Selection, opt *md.Options) *string {
			result := "@"

			u, ok := selec.Attr("user")
			if ok {
				result += u
			} else {
				result += "unknown"
			}

			html, err := selec.Html()
			if err != nil {
				log.Fatalln(err)
			}

			fmt.Println("content:", content)
			fmt.Println("selec:", html)
			fmt.Println("result:", result)

			return &result
		},

Output:

content: 

 blabla  

selec:

        blabla 

result: @user3 
content: @user3
selec:
        <mention user="user3">

        blabla
        </mention>
result: @user2
content: @user2
selec:
        <mention user="user2">
        <mention user="user3">

        blabla
        </mention></mention>
result: @user1

Nested lists aren't converted correctly

Describe the bug I'm seeing a problem converting nested HTML lists. The problem appears with either ordered (<ol>) or unordered (<ul>) lists.

HTML Input

<ol>
	<li>One</li>
	<ol>
		<li>One point one</li>
		<li>One point two</li>
	</ol>
</ol>

Generated Markdown

1. One

1. One point one
2. One point two

Expected Markdown

1. One
    1. One point one
    2. One point two

Additional context I see this with the latest version (1.2.1). I'm using the following test code to check this:

package main

import (
	"fmt"
	"log"

	md "github.com/JohannesKaufmann/html-to-markdown"
)

func main() {
	converter := md.NewConverter("", true, nil)

	html := `
<ol>
	<li>One</li>
	<ol>
		<li>One point one</li>
		<li>One point two</li>
	</ol>
</ol>
`

	markdown, err := converter.ConvertString(html)
	if err != nil {
		log.Fatal(err)
	}
	fmt.Printf("md ->\n%s\n", markdown)
}

Thanks for the library!

Broken output with new lines between tags
The problem may appear in a wider amount of cases, but what I've got so far is the following:

There are text posts with links to videos in specific tags

<video>https://youtu.be/SoMeViD</video>\r\n<video>https://youtu.be/SoMeViD</video>

html-to-markdown doesn't understand them, which is absolutely fine, I just want it to leave for further processing. When there is one, or they are separated with some elements - no problem at all, everything works perfectly. However when there two or more, it results in:

https://youtu.be/BpDqa2K0hvIhttps://youtu.be/GfE2D62bMTE

Or, if I wanted to make a regular link from it, or embed in iframe I would get this: https://youtu.be/BpDqa2K0hvIhttps://youtu.be/GfE2D62bMTE

I think in such a case separators between tags, such as , \t,  , \n, or \r\n should be kept.
🐛 Bug with square brackets
Describe the bug

Found an issue with square brackets in the input which is confusing me. They end up being converted to \$& in the output. This seems to happen whether they are written in the html as [], [, or [.

HTML Input

first [literal] brackets then [one] way to escape then [another] one

Generated Markdown

first \$&literal\$& brackets then \$&one\$& way to escape then \$&another\$& one

Expected Markdown

first \[literal\] brackets then [one] way to escape then [another] one

Additional context

I had this issue come up with some options configured, but then went ahead and removed all configuration to test and I'm still seeing it. Is it something on my end I'm doing incorrectly perhaps? I'm not very experienced with golang so it's possible I'm making a silly error.

🐛 Bug Can not handle img

Describe the bug A clear and concise description of what the bug is.

HTML Input

<figure><img class="lazyload inited loaded" data-src="https://i.loli.net/2020/08/13/cVomW7L9YOTw2uA.png" data-width="800" data-height="600" src="https://i.loli.net/2020/08/13/cVomW7L9YOTw2uA.png"><figcaption></figcaption></figure>

Generated Markdown

<img class="lazyload inited loaded" data-src="https://i.loli.net/2020/08/13/cVomW7L9YOTw2uA.png" data-width="800" data-height="600" src="https://i.loli.net/2020/08/13/cVomW7L9YOTw2uA.png">

Expected Markdown

nonting

🐛 Bug: Support `` for code next to `` tags
Describe the bug Unfortunately, some sites don't use semantic markup, e.g., http://math.andrej.com/2007/09/28/seemingly-impossible-functional-programs/ but instead specify the font directly using tt. Since markdown draws no distinction b/w code and things simply formatted in "typewriter style", these should be recognized at well (or, at least, as a plugin). HTML Input <tt>Some typewriter text</tt> Generated Markdown Some typewriter text Expected Markdown `Some typewriter text` Additional context N/A

Extra elements in blocks Some websites use <code> blocks with elements inside. It seems to be the case when the syntax highlighting is computed server-side, rather than on the browser with some JS library such as prettify. To reproduce: func main() { converter := md.NewConverter("", true, nil) url := "https://atomizedobjects.com/blog/javascript/how-to-get-the-last-segment-of-a-url-in-javascript" markdown, _ := converter.ConvertURL(url) fmt.Println("markdown) } What I get (scrolling down a bit): ``js window.location.pathname.split("/").filter(entry => entry !== ""); // ["blog", "javascript", "how-to-get-the-last-segment-of-a-url-in-javascript"] `` What you get if you just remove all elements from the generated markdown: window.location.pathname.split("/").filter(entry => entry !== ""); // ["blog", "javascript", "how-to-get-the-last-segment-of-a-url-in-javascript"] I know that an easy workaround on my side would be to just clean things up with goquery, but I figured it would be better to have it fixed here directly. Thanks! 🐛 `<` and `>` should not be converted to `<` and `>` Describe the bug < and > should not be converted to < and >, it breaks the resulting markdown. HTML Input <not a tag> Generated Markdown <not a tag> Expected Markdown <not a tag> Additional context Markdown parsers take <not a tag> as a tag and do not show it. That's not what is in the HTML though. Example: https://spec.commonmark.org/dingus/?text=%3Cnot%20a%20tag%3E%0A%0A%26lt%3Bsecond%26gt%3B Bump github.com/yuin/goldmark from 1.4.14 to 1.5.3 Bumps github.com/yuin/goldmark from 1.4.14 to 1.5.3. Commits a87c577 Fix #333 aaeb985 Merge pull request #332 from stefanfritsch/feature-fences 34756f2 Add link to goldmark-fences c71a97b Fixed bug related newline code ae42b91 Fix bug that escaped space not working with Linkfy extension 1dd67d5 Add CJK extension 95efaa1 Merge pull request #324 from soypat/patch-1 a3630e3 Fix #323 3923ba0 Add goldmark-latex extension to README.md See full diff in compare view Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase. Dependabot commands and options You can trigger Dependabot actions by commenting on this PR: @dependabot rebase will rebase this PR @dependabot recreate will recreate this PR, overwriting any edits that have been made to it @dependabot merge will merge this PR after your CI passes on it @dependabot squash and merge will squash and merge this PR after your CI passes on it @dependabot cancel merge will cancel a previously requested merge and block automerging @dependabot reopen will reopen this PR if it is closed @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) Potential issue in the Table plugin with the isFirstTbody logic Hello in the table.go plugin there's an issue with the firstSibling logic in the isFirstTbody function. func isFirstTbody(s *goquery.Selection) bool { firstSibling := s.Siblings().Eq(0) // TODO: previousSibling if s.Is("tbody") && firstSibling.Length() == 0 { return true } return false } I'm retrieving tables from confluence html format tbody-tr-th's. Somehow the firstSibling.Length() is not 0 haven't figured it out completely but when I comment it out it seems to do what it's supposed to do although might introduce a new bug :). github.com/JohannesKaufmann/html-to-markdown v1.3.6 github.com/PuerkitoBio/goquery v1.8.0 🐛 Bug: Support MathJax custom tags Describe the bug MathJax is a JavaScript library allowing to add "custom tags" such as $...$ to HTML which will then be turned into e.g., MathML or whatever the browser supports. Depending on the Markdown implementation math is either not supported at all -- or directly through the same syntax. Either way, it'd probably make most sense to simply keep $...$ expressions intact and not escape strings contained therein. While a simple filter for that would certainly work, MathJax allows supporting different escape characters than $...$ for inline- and $$...$$ for display-math, e.g., from the article https://math.andrej.com/2007/09/28/seemingly-impossible-functional-programs/: <script> window.MathJax = { tex: { tags: "ams", inlineMath: [ ['$','$'], ['\$', '\$'] ], displayMath: [ ['$$','$$'] ], processEscapes: true, }, options: { skipHtmlTags: ['script', 'noscript', 'style', 'textarea', 'pre', 'code'] }, loader: { load: ['[tex]/amscd'] } }; </script> This would necessate parsing Js though ... HTML Input some formula: $\lambda$ Generated Markdown some formula: $\\lambda$ Expected Markdown some formula: $\lambda$ Additional context This filter (or "unfilter") may be only activated, if MathJax is detected, and otherwise disabled. Further, as mentioned earlier, a more sophisticated parsing of the HTML may be used to detect the precise math-HTML tags used or make them configurable at the least. 🐛 Bug is converted into two new lines (\n\n) Describe the bug In my testing I've found that the HTML tag gets turned into two new lines (\n\n); Example: (⎈ |local:default) prologic@Jamess-iMac Mon Aug 02 11:37:55 ~/tmp/html2md (master) 130 $ ./html2md -i Hello World Hello World HTML Input Hello World Generated Markdown Hello World Expected Markdown Hello World Additional context Is there any way to control this behaviour? I get that this might be getting interpreted as a "paragraph", but I would only expect that if there are two (s) or an actual paragraph .... Thanks! Spacing & numbering issues with nested lists Describe the bug I see a couple issues with nested lists. One issue is that there are extra line breaks between list items in nested lists. When I render this in my application, it wraps text with a if there's an extra line break (which has implications for margin/padding). Another (small) issue I see is that numbering gets off for numbered lists. I realize this doesn't matter with Markdown, but I thought I'd note it. HTML Input The Corinthos Center for Cancer will be partially closed for remodeling starting 4/15/21. Patients should be redirected as space permits in the following order: <ol> <li>Metro Court West.</li> <li>Richie General.</li> <ol> <li>This place is ok.</li> <li>Watch out for the doctors.</li> <ol> <li>They bite.</li> <li>But not hard.</li> </ol> </ol> <li>Port Charles Main.</li> </ol> For further information about appointment changes, contact: <ul> <li>Dorothy Hardy</li> <ul> <li>Head of Operations</li> <ul> <li>Interim</li> </ul> </ul> <li>[email protected]</li> <li>555-555-5555</li> </ul> The remodel is <a href="http://www.google.com/" target="_self">expected</a > to complete in June 2021. Timeframe subject to change. Generated Markdown The Corinthos Center for Cancer will be partially closed for remodeling starting **4/15/21**. Patients should be redirected as space permits in the following order: 1. Metro Court West. 2. Richie General. 1. This place is ok. 2. Watch out for the doctors. 1. They bite. 2. But not hard. 4. Port Charles Main. For further information about appointment changes, contact: - Dorothy Hardy - _Head of Operations_ - _Interim_ - [email protected] - 555-555-5555 _The remodel is_ [_expected_](http://www.google.com/) _to complete in June 2021._ **_Timeframe subject to change_** _._ Note how there are extra line breaks after "2. Richie General.", " 2. But not hard.", "- Dorothy Hardy", and " - Interim". Also note how "4. Port Charles Main." should be "3. Port Charles Main.". Expected Markdown The Corinthos Center for Cancer will be partially closed for remodeling starting **4/15/21**. Patients should be redirected as space permits in the following order: 1. Metro Court West. 2. Richie General. 1. This place is ok. 2. Watch out for the doctors. 1. They bite. 2. But not hard. 3. Port Charles Main. For further information about appointment changes, contact: - Dorothy Hardy - _Head of Operations_ - _Interim_ - [email protected] - 555-555-5555 _The remodel is_ [_expected_](http://www.google.com/) _to complete in June 2021._ **_Timeframe subject to change_** _._ Additional context I see this with the latest version (1.3.0). I'm using no plugins. Thanks for the utility! Is `Converter` safe for use by multiple goroutines? This should be documented. Is it safe to use by multiple goroutines? Am I expected to use one single instance of Converter with same configuration across my app, or to create new in each case? What's the design, what are performance considerations? PS: there is sync.RWMutex within Converter struct, so the answer is probably yes, but, again, this should be documented to not guess or reverse engineer.

⚙️ Convert HTML to Markdown. Even works with entire websites and can be extended through rules.

html-to-markdown

Installation

Usage

Using it on the command line

Options

Adding Rules

Using Plugins

Writing Plugins

Other Methods

func (c *Converter) Keep(tags ...string) *Converter

func (c *Converter) Remove(tags ...string) *Converter

Issues

Related Projects

Owner

Johannes Kaufmann

Comments

Mention wrapper program in README.md?

New confluence code block parser plugin

html not suport.

Unexpected result with additional rule for custom self-closing tags

Nested lists aren't converted correctly

Broken output with new lines between tags

🐛 Bug with square brackets

🐛 Bug Can not handle img

🐛 Bug: Support `` for code next to `` tags

Extra elements in blocks

🐛 `<` and `>` should not be converted to `<` and `>`

Bump github.com/yuin/goldmark from 1.4.14 to 1.5.3

Potential issue in the Table plugin with the isFirstTbody logic

🐛 Bug: Support MathJax custom tags

🐛 Bug is converted into two new lines (\n\n)

Spacing & numbering issues with nested lists

Describe the bug

HTML Input

Generated Markdown

Expected Markdown

Additional context

Is `Converter` safe for use by multiple goroutines?

Related tags

Take screenshots of websites and create PDF from HTML pages using chromium and docker

🚩 TOC, zero configuration table of content generator for Markdown files, create table of contents from any Markdown file with ease.

Markdown - Markdown converter for golang

Mdfmt - A Markdown formatter that follow the CommonMark. Like gofmt, but for Markdown

golang program that simpily converts html into markdown

Simple Markdown to Html converter in Go.

Golang library for converting Markdown to HTML. Good documentation is included.

Godown - Markdown to HTML converter made with Go

Convert Microsoft Word Document to Markdown

Easily to convert JSON data to Markdown Table

Convert your markdown files to PDF instantly

bluemonday: a fast golang HTML sanitizer (inspired by the OWASP Java HTML Sanitizer) to scrub user generated content of XSS

gomtch - find text even if it doesn't want to be found

Quick and simple parser for PFSense XML configuration files, good for auditing firewall rules

Parse data and test fixtures from markdown files, and patch them programmatically, too.

Glow is a terminal based markdown reader designed from the ground up to bring out the beauty—and power—of the CLI.💅🏻

A clean, Markdown-based publishing platform made for writers. Write together, and build a community.

Pagser is a simple, extensible, configurable parse and deserialize html page to struct based on goquery and struct tags for golang crawler

Blackfriday: a markdown processor for Go

`func (c Converter) Keep(tags ...string) Converter`

`func (c Converter) Remove(tags ...string) Converter`

html
not suport.

Extra elements in `blocks`

🐛 Bug
is converted into two new lines (\n\n)